DOCOMENT RESUME 

LI 004 180 

Lay, William Michael 

The Double-KWIC Coordinate Indexing Technique: 
Theory, Design, and Implementation* 

Ohio State Univ,, Columbus, Computer and Information 
Science Research Center. 

National Science Foundation, Washington, D.C. Office 
of Science Information Services* 
OSU-CISRC-TR-73- 1 ^; 
Feb 73 

263p.;(0 References); Dissertation 
MF-$0.65 HC^$9.87 

♦Auton?&tic Indexing; *Coordinate Indexes; *Indexes 
(Locaters) ; ♦Indexing; *Information Retrieval; 
Relevance (Information Retrieval) 
DKWIC; Double KWIC Coordinate Index; ♦Key Word in 
Context; KWIC 



The development of an automatic indexing technique, 
called Double KWIC (DKWIC) Coprdinate Indexing, is described which 
extends the KWIC indexing principles to provide easy access to an 
additional level of specificity for information indexed under these 
frequently appearing terms, chapter 2 discusses indexing terminology 
and some fundamental relationships between indexing and document 
retrieval. Chapter 3 sketches a brief -liistory of automated indexes 
describing frequently encountered methods of construction and 
display. Chapter a introduces thfe' Double- KWIC Coordinate Indexing 
scheme and discusses its advantages and disadvantages relative to 
several other KWIC indexing schemes. Chapter 5 discusses refinements 
in the prototype indexing scheme which led to the production of 
KWOC -DKWIC hybrid indexes. Chapter 6 considers the problems of 
vocabulary control in a natural language environment. Several methods 
of automa*-ed vocabulary normalization are described, chapter 7 
examines the role played by the index analyst in creating a 
\Double-KWIC Coordinate Index and resolves the plaguing problem of 
main term selection by an autcwnatic selection algorithm which can 
only be applied successfully with KWIC-DKWIC hybrid indexes. The 
final chapter examines the parametric controls of the KWIC-DKWIC 
indexing scheme and discusses some relationships among these 
parameters and the indexes produced. (Author/NH) 



ED 072 826 

AUTHOR 
TITLE 

INSTITUTION 

3P0NS AGENCY 

REPORT NO 
PUB DATE. 
NOTE 

EDRS PRICE 
DESCRIPTORS 

IDENTIFIERS 

ABSTRACT 



41? 



U.S' DEPARTMENT OF HEALTH, 
EDUCATION «, WELFARE 
OFFICE OF EDUCATION 
THIS DOCUMENT HAS BEEN REPRO 
DUCED EXACTLY A§,RECEIVED FROM 
THE PEflSON OR ORGANIZATION ORiG 
INATING IT POINTS OF VIEW OR OPiN 
IONS STATED DO 'NOT NECESSARILY 
REPRESENT OFFICIAL OFFICE OF £DU-^ 
CATION POSITION OR POLICY 



(OSU-CISRC-TR-73-1) 



THE DOUBLE-KWIC COORDINATE INDEXING TECHNIQUE: 
THEORY, DESIGN, AND IMPLEMENTATION 
by • 

William Michael Lay 



c 

Work performed under 
Grant No. 534.1, National Science Foundation 



r 



Computer and Information Science Research Center 
The Ohio State University 
Columbus, Ohio 43210 
February 1973 



PREFACE 



This work was done in partial fulfillment of the requirements for 
a doctor of philosophy degree in Computer and Information Science from 
The Ohio State University. * It was supported in part by Grant No. GN 
534.1 from the Office of Science Information Service, National Science 
Foundation, to the Computer and Information Science Research Center of 
The Ohio State University. 

The Computer and Information Science Research Center of The Ohio 
State University is an interdisciplinary research organization which 
consists of the staff, graduate students , .and faculty of many University 
departments and laboratories. This report is based on research accom- 
plished in cooperation with the Department of Computer and Information 
Science. 

The research was administered and monitored by The Ohio State Uni- 
versity Research Foundation. 



ii 



. " ACKNOWLEDGMENTS 

I would like to express my appreciation to the many people who 
contributed to the successful completion of this work. 

I am indebted to Professor /inthony Petrarca, my advisor, who ini- 
tiated this investigation and whose valuable assistance and occasional 
prodding innneasurably aided ' the progress and fruition of this work. I 
am very grateful to the Professors. James Rush and Lee White for serving 
as members of the committee who read this dissertation. 

I am ^ appreciative of Professor William Atchison who allowed my 
continuance of this work while I was teaching at the University of Mary- 
land and to Mr. Robert Jones of the Health Sciences Computer Center of 
the University of Maryland who allowed me to use the HSCC computing fa- 
cilities to test some of the programs designed and to produce this doc- 
ument • 

Partial support of this work has been provided by a grant (Gtl- 
534.1) from the National Science Foundation to the Computer and Infor- 
mation Science Research Center, by the Ohio State University Instruction 
and Research Computer Center wh6 donated iPuch of the computer time, and 
through a Title Il-b Fellowship in Library and Information Science 
'awarded by the Office of Education. 

Finally, I would like to express my gratitude to my wife, Carolyn, 
who endured the years I spent as a graduate student lending hardy moral 
and sometimes physical support to this work. 



iii 



TABLE OF CONTEUTS - 

paqe 

PREFACE 

ACKNOWLEDGMENTS ' iii 

TABLE OF CONTENTS ^. . iv 

LIST OF FIGURES . . . ix 



LIST OF TABLES xiii 

CHAPTER 

I. Introduction: The Seed for Better Indexing 

Practice 1 

II, Indexing Terainclogy and So»e Fundaaental 

Gelaticnshops Betveen Indexing and Document 

Retrieval ••••*•••« 7 

III. Automated Indexing: A Brief History •= 13 

1 Cosputer-Cospiled Indexes 19 

1.1 Botated Keyword Index 21 

1.2 Ccmpletely Permuted Keyword Index 22 

1.3 Selected-List ing-In-Coinbination <SLIC) 

Index 23 

l.i^ t'EEaUTBRM Index 26 

2 Computer-Generated Indexes 28 

2.1 Key-Hord-Ir-Context (KtfIC) Index and 
Key-Hord-Out-of-Context <KWOC) Index 30 

2.2 PANDEX Index 36 

2.3 Articulated Subject Index 3^ 

3 Approach Explored in This Thesis 



iv 



IV. The Prototype Douhle-Kwic (DKHIC) Coocdr.nat-.e 

Index it^i 

p 1 Construction of the Double-KWIC' Coordinate 

In lex 5^ 

/ 2 Utility of 5 the Double-KWICU Coordinate Index ... 56 

3 Stoplists for the Prototype Double-KWIC 

Coordinate Index 5^ 

a Advantages and Disadvantages of the DKi'JC 

Indexing Technique 61 

' ■> * 
5 Prototype System Design 62 

V. Evaluation and ffodif ication of the Prototype 

Systei: The K»0C-DK»IC Hybrid Index 66 

^ ^ 1 The Modified System Design; Production of 
S KWOC-DKMIC Hybrid Indexes 68 

2 Extraction of Potential Main Terms (PMTs) 69 

3 Human Interface Hegtfiremesnts for the Selection 
of Actual Main Terms (AMTs) and K«OC-DKWIC 
Tfireshold Values 74 

a other Features of the KSOC-DKHIC Hybrid Systen 75 

/ — X 

VI. Vocabulary Control for Natural Language Indexing . 77 

1 Resolving Inflectional Scattering 79 

1.1 Stemming and Hecoding for Printed Indexes . 83 

1. 2 Plural-Singular Stemming-Recoding 

Algorithm 8a 

2 Synonymal Scattering 

3 Are Titles Sufficient? 12 



v 



VII. Evolution of the KWIC-DKi^IC Hybrid System for 
Autofnating AMT Selection ^in the DKWIC Ir.dexinq 
Systems « 

1 Magnitude of the H'Aan Interface Requirea^nts 

for the DKWIC Indexing Operations 95 

2 Exaisination of the AMT Selection Processes .... 9^^ 

3 A?!T Selection Algorithms for ?!iniinizir.g Index 

Size and Cost 99 

4 Influence of the P*iT Generation Process on AMT ""^^ 
Selection Algorithms « 10^ 

4.1 A Process for Generating Exclusive PSE 
{Potential Subordinate Entry) Sets 106 

4.2 Haximal l!ain Terms (MMTs) and Specificity 
Onits ; 109 

5 An AMT Selection Algoritha Ill 

6 Autciating the AMT Selection Process 113 

7 Autoaatic AHT Selection Failures and Their 
Bemedies: The KWIC-DKMIC Hybrid Index 116 

S lapleaentation of Automated -AST Selection 

in KUIC-DKWIC Hybrid Indexes ; 11^ 

B.I Generation of Maximal Main Teras 119 

9.2 Seiectica of Actual Main Terms 122 

8.3 Generation of AMTs from the MMT File' and 

AMT Marker Pile 127 

fl.4 Actual Subordinate Entry (ASE) 

Construction 1 29 

8.5 Printing the KWIC-DKHIC Hybrid Index 131 



VIII. Pesults^ Conclusions, and Directions for Future 

Research 132 

1 Influence of Various Parameters on Characteris- 
tics of the Index, and Supporting Exper iaer.tal 
Evidence 132 



vi 



ERLC 



9 



ERIC 



2 Future Research and Possible Itpprovemen ts in 

the DKWIC Indexing Technique 139 

2. 1 Actual Subordinate Entry Regulation ' 1^0 

2.2 Automat€d Generation of "See" aM "See 

Also" Cress References 1U3 

2.3 Other Possible Index Refining Procedures .. 1U6 

3 Concluding Remarks 1^7 ^ 

APPENDICES 

A On Counting Index Entries of an Articulated : 

Subject Index U9 , , 1 

B On Estiaating the Humber of Entries of a 

KHIC-bKHIC Index 155 | 

C System Installation and Execution Instructions : 
' for the Double-KWIC Coordinate Index Subsystems .. 156 | 

1 Fora of the Distributed Indexing Subsystesis ... 156 

2 Job Control Installation and Execution Aids ... 15^ 

3 Installing the DKHIC Indexing Subsystems 164 

U The KWOC-DKWIC Hybrid Index Generator - 

Documentation ; « . 16B 

U.I KHOC-DKWIC Execution Parameters 16? . 

4.2 Input of Stoplists to the K»OC-DKWIC \ 
Index Generator 17 3 \ 

4.3 Selecting Actual Main Terms for a I 
KHOC-DKHIC Index 175 1 

4.4 Job Control for a KWOC-DKWIC Index | 
Generation*.^ 175.. I 

4.5 Sample JCL for a K^OC-rDKWIC Index | 
Generation 176 | 

4.6 Messages Issued by the .KWOC-DKWIC | 
Index Subsystem .177 [ 

4.7 KWOC-DKWIC Index Subsystem Implementation 3 
Restrictions - 179 1 



vii 

0 ♦ . 



1 



I 



\ 



Pag«» 

5 The KWIC-DKWIC Hybrid Index Generator - 
T Oocuoentation #• 179 

Sn» 1 KHIC-DKMIC Execution Para'ineters 181 

5.2 Input of Stoplists to the KWIC-DKHIC 
Index Generator 1B5 

5.3 Jot Control for a KMIC--DKWIC Index 
Generation ; 185 

5.4 Sample- JCL for a KHIC-DKHIC Index ^ 
. ^ Generation ' 187 

5.5 Messages Issued by the KHIC-DKHIC 
Index Sufcsysten . 187 

5.6 KWIC-DKKIC Index "Subsystem Implementation 
Restrictions 189 

6 The Authority List Generator - Docum^entation .. 190 

6.1 Authority List Execution Parameters 190 

6»2 Authority List Exceptions List Input .. . 191 

6.3 Authority List Format 193 

6.J\ Job Control for the Authority List 

Generator 195 

6.5 Sample JCL for the Authority List 
Generator ; 196 

6.6 Messages Issued by the Authority List' 
Generator ....1 196 

6.7 Authority List Subsystem Implementation 
Restrictions 197 

7 Interfacing the Data Base : 197 

7.1 Requirements of an Interface S.ubroutine ... 198 

7.2 Chemical Titles Interface Subroutine 199 

8 Word Finder Subroutine 202 

. BIBLIOGRAPHY ; 206 

GLOSSARY ... 212 

INDEX .. 213 

\ 

viii 

ERIC 



LIrST OF FIGOPES 

t 

\ • " .. ' page 

3.,1 A portion of a SLIC index \. 25 

3.2 A portion of a PEE^OTEHM index -2ft 

3.3 A portion of a KWIC index ^ . I 3^ 

3. .4 A portion of a KWCC index ' 34 

3.5 A portion of a PANCEX index 3B 

3.6 A portion of an articulated siibject index 3^ 

3.7 All articulated index pRrases generated from the 

title "Articulation in Indexes ,f or Books on 
. Science" .\ 42 

4.1 A portion of a conventional KWIC index illustrating 

the randqtization of secondary concepts found 

for a hiqh^density keyvord 47 

4.2 A variant form of a KWIC (also called KWOC) index 

illustrating ccsplete randosization of secondary 

ccncepts for the same titles illustrated in 

Figure 4.1 49 

4.3 Another KWOC format illustrating complete randomi^za- 

tion of secondary concepts, for the high- 
density concepts of Figure 4.1 ^ .50 

4.4 A PANDEX index for the same titles of Figure 4.1 

illustrating partial ordering of a single 
secondary concept for each title where the 
secondary concept chosen, is hot always the most 
appropriate one 52 

4.5 Construction,^f the prototype Double-KWIC (DKWIC) 

coordinate index entries 1 1. 54 

4.6 Annotated description of the display format for the 

prototype Double-KWIC coordinate index derived 
from titles in Jour nal^of Chemical Docum enta tion^ 

vciunio 7 7. .77.77.. . .77.. i 7..7....7... 55 



5^ 



ix 



\ 



V 



ERIC 



4 



' Page 

'4,^ "^KWIC iniex entries for the satne hiqh-densit y term 
of Figure illustratinq ordered access to all 

seconlary concepts represented by sianificant 

words in 'the t itles 5H 

• * • . 

.Illustra^tion cf a two- word main term which provides 

iniaed-iate access to more specific concepts 5B 

U.9 A three-- word aain tera of a DKWIC index 59 

U.. 10 Systea desiqn for creatinq the prototype DKWIC 

index ^ ; 6a 

5.1 Size-Ballooning effect in the prototype DKWIC ind^ex 

caused by perauting subordinate entries under 

main terms derived froa only a single^ title^^r^ , . 66 

5.2 Stutter ing 'effect and size-ballooning effect in the 

prototype DKWIC xndeic caused by permuted -subor- 
^ din^te entries for a aaio tern -which appears more 

than once in' a title .V 67 

'^^^ • # 

5.3 annotated description of the construction of index 

terns for the KHCC-DKHIC hybrid index 7Cx 

/ ^ 

5. '4 Systea design for creating the KHOC-DKWIC hybrid 

index .CA* * 71 

5.5 Illustration of effect of word deliaiters and 

selection /criteria^on generation of potential main 
terms and potential index er. tries ^ fro.m a title .. 7.3 

5.6 A portion of a PMT list and occurrence fr^quencjj 

data used for selection of actual main terms .... 7U 

5.7 Example cf 'two types of ^subordinate entries found 

in a KHOC-DKWIC hybrid index ? 75 

6. ^1 Inflectional scattering in a, KWIC index *. "7^ 

6.2 A portion of' the prototype DKWIC index illustrating 

scattering due tc the occurrence of singular 
^ and plural word forms 8n 

f 

6.3 A portion Of an automatically genera ted authority 

1 ist ^produced by the plural-singular stemming-^ 
recoding algorithm ...... ^. 87 



I 



' ' Page 

6. U Peduced scattering in a DK^tlC index as a result of 

applying an autcaat ic'ally- gen^jrated authority 

list to words of main terms SB 

6.5 Synchyinal pointers found in a K-WIC index a^ "see 

also" cross references 90 

6.6 Vocabulary normaiization in a PANDEX index collating 

pref3rr€d words hut not altering the original ' 

text 91 

7.1 A potential main t€£a group consisting of all PMTs 

which begin with the same word • 101 

7.2 An^ A«T treeXchosen froa the PMT group of Figure 7.1 102 

7.3 The F«T tree , for .the PUT group of Figure 7.1 showing 

values for total PSE sets (P) an^ exclusive PSE 
sets. (Z) for al,l the nodes ' 1107 

7. U Terminal EMT statistics, Z<t>, for the PMT group of 

Figure 7. 1 ... 103 

7.5 fhe specificity units generated froai a title 110 

7.6 The maxiaal main terns foraed from ^the specificity 

units illustrate^ in Fig;ure 7.5 Ill 

7.7 The selection override coaaands necessary to form 

the AHT selections' illustrated in Figure 7.2 

froa the H«T group , in Figure''7.a ; v.. -113 

7.8 The Ipg^'cal flow for an autoaated main tera 

selection process 114 

7.9 A trace cf autoaat-'Cd .aain term Selections for th.e 

EMT tree of Figure 7.3 , ^ . 115 

7.10 A suaaa'ry of autcaatic aain term selections , 

perf ormed on the PMT/tree of Figure 7 . 3' 116 

7. It Display format for the KMIC-DKWIC hybrid index ...1-119^ 

7.12 Ihe system desig^ri. for creating KWIC-DKWIC hybrid 

indexes witft atitoaatic AHT selection ; -120 

» ■ I - 

7.13 Flowchart describing naxiaial main term gpn^ratior . 121 

xi 



o 1 
ERICf 



Page 

7.ia An illustration cf the linearized PMT trpe format 

for the MMI group illustrated in Piqure 7»a 123 

7.15 Flowchart describing the construction of a PMT 

tree from a ttMT group f24 

''•16 Flow.chart describing the AMT seltjw.i.on process .... 125 

7.17 The fornats of the actual main term and the 
exclusive PSK. markers produced by \the A?1T 
selection algorithm .\ 126 

'7! 18 An illustration cf the AMT and exclusiV^ PSE count 
m'arkers automatically produced by the ^MT select- 
ion algcr-ithm ftcm the MMT group of Fig^iire 7.4 .. 127 

7.19 Flowchart describing the tailoring pf records 

tc form actual main terms , 128 

7.20 Flowchart describing the generation of ASEs 130 

7.'21 Flowchart describing the printing of the final { 
index . . . • ^ ♦ • . . . ♦ . • . . ^ 1 

8.1 A graph illustrating influence of minimum posting 

threshold, maximum posting threshold, permutation 
threshold, and word occurretice frequency on the 
I selection of AHTs 1 3U 

8.2 Some general statistics concerning an index , 

igeneraticn • « • . 1 36 

8.3 Subordinate terms generated by applying some word- 

proximity restrictions to ASE sel^-ntion 142 

Q.^ An illustratioiT of a "see" cross reference and the 
eciriched title from which the reference was 
generated 1 ..•.•«.. 14U 

8.5 An example of structural scattering that occurs in 
double-KWIC coordinate indexes due to the 
syntactic structure of natural language 14.7 



xii 



LIST OP TABLES 



page 

0.1 A comparison of the nuaiber of main teros aenerated , 
at a particular specificity as posting limits 
are varied ••••• • 137 



I 



8,2 Index size and the percent DKWiC-type entries for \ 
indexes prepared' from the same titles with I 
various posting thresholds ;. 138 I 



xiii 



rERlC 



CHAPTEP I. INTRODUCTION: THE NEED FOR BETTER INDEXING 
PRACTICE 

"...unless this mass (of information) be properly 
arranged and the means furnished by which its 
contents may b€ ascertained, literature and 
science will fce overwhelmed by their own unwieldy 
bulk.*« 

Annual Report of the Smithsonian 
Institute for 1B51 

For more than a century this warning given explicitly 
by Jchn Henry, Director of the Smithsoniar Institute in 
1851, went utheeded^ He forsaw a potential unsuraountable 
barrier of literature when the total increment to man's 
published works was estimated at 20,000 volumes annually. 
Henry's statement was ignored as were others issued from 
time to time by those who saw the impending danger buried 
beneath the accumulating bulk of literature. 

The inevitable explosion accompanied by a frantic call 
for control came dcrlug the boom following World War II. 
The world's research effort, stimulated by a war-timo 
environment, produced a new flodd of literature so great 
that the existing methods of information dissemination could 
no longer be considered adequate. Simultaneously, such a 
realization was evolving within the ^scientific community. 
Research could be increasingly stimulated by an intelligent 
insight into what had gene before or what had been reported 
in the literature. It was ironic that the recognition of 

1 



the failure cf traditional dissemination techniques should 
accroffijpar.y man's qreatest ne^.d for information control! 

Not until that ticae dxd man finally acknowledge that 
the traditional library tools were not only inadequate but 
actually liaiting his ability to cope with th*= many new 
problems that faced him. He required highly specialized 
i nf or nation cur rent 1 y being spawned by the scientific 
coffiounity as well as those past explorations buried deep 
beneath the "unwieldy bulk.'* He was thwarted' by the 
necessary tiae lag of traditional techniques and severely 
restricted by the conventional indexing schemes. He was 
fxustrated by: 

a) the physical impossibility of his reading and 
renembering all of the literature that could have a 
reasonable probability of being of interest at some 
ufnspecified future tioe; 

c) the economic impossibility that he couli process a 
major p'art • of the literature for later exploitation 
that exhibited prcbable interest; 

c) the mechanical impossibility that the ' currently 
employed literary procedures could effectively cope 
with his highly specializ.ed requests. 

Dr. Vannevar Bush in a report to the President and^ 
later in an cften quoted paper {Bush, 45} focus<=»d atte'^tion 
on a most critical deficiency in traditional library 
practices: 
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'•...The difficulty seeas to te, not so much that 
publish unduly in view of the extent and 
variety of present day inter^sts^ but rather the 
publication has been extende4 far beyond our 
present ability to make real use of the record. 
Th^ summation cf huaan experience is being 
expanded at a prodigious rater and the ineans ve 
use for threading through the consequent ma^e to 
the ffloaentary important item is the sane as was 
used in the days of square-rigged ships... The 
real heart of the matter of selection, hovever, 
goes deeper than a lag in the adoption of 
mechanisms by libraries, or a lack of development 
of devices £pr their use. Our ineptitude in 
getting at the record is largely •caused by the 
artificiality of the systems of indexing..." 

The overwhelming need for literature retrieval combined 

with Bush's observations on traditional indexing methods 

prompted many researchers to directly attack the problems 

frustrating the library users. The advent of electronic 

machines used to manipulate non-numeric^ data • spurred th^ 

development cf mechanized- Approaches to indexing and library 

lanagetsent. 

Considering Bush's comments, the study of these types 
of problems should mere aptl y be en tit led "information 
storage for retrieval*" The literature still abounds with 
data, conclusions, opinions, and theories applicable to a 
host of fields ^and recorded in journals, reports, 
proceedings, and theses too numerous to comprehensively 
list . 

The need for h igh- quality printed indexes has not 
,d iminished despite the recent strides in automatic 
information retrieval systems. Since the application of 



koy-wori-in-ccntext (KMIC) indexing (and key-word-out-'Of- 

cont*»xt (KWOC) indexing) as an automated derivative indexing 

technique {Luhn,59), the 'KHIC index has been used widely but 

not without soae dissatisfaction with its quality as a 

retrieval tool {Fischer, 66} • Most attempts to iaprove its 

quality have dealt with variations in format to iaprov*^ 

readability, or with enrichaent terms to provide additional 

index entries which otherwise would not have been derived 

fro« the words in the titles. Neither of these 

modifications improve the quality of the i'ndek when an index 

tera appears frequently in the title phrases indexed. In 

f - , - ~ 

tMs^xrase, index tens fora large blocks of index entries 

where access to more specific concepts is hindered by th^ 

randca scatterinq of secondary concepts in each index 

phrase. The user must scan the context about each term in 

the block in order to determine that subset of entries which' 

is pertinent to a more specific search. 

This thesis describes the devip?lopment of an automatic 

indexing technique, called Double-K»IC {DKillC) Coordinate 

Indexing, wnich extends the KWIC indexing' principles to 

provide easy access to an additional level of specificity 

for inforaation indexed under these frequently appearing 

terms. Chapter 2 discusses index inq .terminology and soae 

fundamental relationships between indexing and document 

retrieval importaS? to the chapters that follow. Chapter 3 

sketches a brief history of automated indexes describing 



frequently encountered methods of construction and display^ 
Chapter 4 i?4 troduces the Double-KW IC Coordinate Indoxinq 
scheme . and discusses its advantages and disadvantages 
relative to several other indexing schemes based on KWIC 
indexing principles. Chapter 5 discusses refinements in the 
prototype indexing scheme which led to the production of 
KWOC-DKWIC hybrid indexes. Chapter 6 considers the problems 
cf vocabulary control in a natural ' language environment. 
Several methods of automated vocabulary normalization are 
described which provide a basis for an effective automated 
solution to some scattering problems in printed indexes. 
Chapter 7 examines the role played by the index af:alyst in 
creating a Dcuble-^KWIC Coordinate Index and resolves- the 
plaguing problem of mai^n term selection by an automatic 
selection algorithm which can only be applied successfully 
with KWIC-DKBIC hybrid indexes. The final chapter examines 
the parametric controls of the KHIC-DKWIC indexing scheme 
and discusses some relationships among these parameters and 
the indexes produced. Some concluding remarks, spell out 
areas where this indexing method can be modified further to 
supply even more useful indexes. Appendix C of this thesis 
acts as a documentation guide to the computer programs 
written to generate K8GC-DKWIC and KHIC-DKWIC indexes^ with 
or without vocabulary control. A KHIC-DKWIC index of this 
document prepared from the phrases appearinq in the Table of 
Contents, List of Tables, and List of Fiqur^^s serves not 



only as ar. example of the^indexing syst«»i!i described in this 
thesis but also provides an index to important topics of the 
thesis. 



CHAPTEB !!• INDEXING TERMINOLOGY AND SOME FUNDAMENTAL 
PELATICNSHIPS BETWEEN INDEXING AND DOCUMENT 
RETBIEVAL 



Since this thesis deals with the automatic construction 
of Oseful indexes to collections of docuaents, a .few 
def initions and, relationships appropriate to - the general 
topics of indexing and document retrieval are presented in 
this chapter. A do cugent is an identifiable collection of 
concepts which can be considered as a single unit. A 



jonrnai or journal article, a chapter of a book,* a paragraph 
of a chapter, or an entire book can be consideced as a 
document. A docusent say be soaething other than 
conirentional printed aatter, such , as ' a file recorded on 
aagnetic tape or a action picture fila. In general, a 
d^ocuaent will assaae three attributes: a title, a body, and 
an Accession code. A title is a condensed description of 
the contents of the docuaent body and usually consists of 
several phrases composed of high-content words. The body of 
a docuaent contains a discussion of . the relationships 
existing aaang the concepts described therein while an 
dccession^code is a coded identifier of the document. 

An index is a docuaent consisting of an ordered set of 
index entries. Each index^ en try describes, via an index 
tera^ a^subset of the concepts found in an identifiable 
class of ^4ocuaents and contains a oeans of locating this 
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cl^ss of iocuaents. For exaople^ an index coramonly* found in 
the back of most books , consists of index entries listed 
alphabetically (an ordf-ring) on the basis of the important 
topics (index tens) discussed in the text. The document's 
in which these concepts are described are identified and 
located by page nuofcer (accession code) • Here, a document 
is equivalent to a page and the class/ of documents 
identified by the index entry consists of a list of page 
numbers. A single index entry rarely provides information 
concerning every concept . described in the document it 
identifies, as the example above implies. Consequently, th^ 
topic discussed on the pages noted in an index entry may be 
one of many discussed within the body of the indicated page. 
In this example, it was assumed that the page rumbers listed 
in the index entry referred to pages of the text containing 
the index* This may seem to be a .trivial point, but its 
importance becomes more- apparent when large collections of 
documents are to be indexed. 

The means of locating a document, its accession code, 
may be much broader- in sco|)e to aid the retriever. For 
example, in Chemical Titles and other piiblicatiors produced 
by ChefficaX Abstracts Service {CAS, 72} , documents (journal 
articles) are identified by a 17 character field which 
includes a coded journal title (ASTK coden) , its volume and* 
page number. Libraries employ an accession coding scheme 
which .reflects the subject matter of the document as well as 



its shelf location within the library {see Dewey^65}. 
?egar31ess* of its length or usefulness to the- r^triever^ th*^ 
accession codes assigned to documents of a collection will 
ke assuaed unique. 

It is sometimes convenient to vi^w an index as, a 
trapping cf a^^document space^ into ar ordered index space^ 



f:D -> I ^ . * , ^ 

The indexing fnncticn, relates elements of D, documents^ 



to corresponding elements of index entries. 

For every document, d, in there exists a set of 

index descriptors generated by applying the indexing 

function to the ddcu»ent. Thus, 

set of index descriptors of d<j> 

= f (d<j>) = {i<1>,i<2>,...,i<n<j»}<j> ♦ 

That is, for each document of D there exists a set of 

index descriptors in I t*hich describe the concepts contained 

in the document* The number of index entries qererated from 

the above descriptors, n, is a measure of the identified 

(and accessible) concepts of the document d, and is 

sometimes referred to as the bre adth of indexing, Thp degth 

o f ind e xing refers to the amount of detail about the concept 



* The notation used in the above equation and elsewhere in 
this thesis deviates slightly from the notation normally 
used because of the limited character set available for 
keyboarding of this thesis which was processed and printed 
by computer text processing programs. The form of the 
notatior* asei for this thesis is summarized in the Glossary. 



-described by an index entry^ The application of the 
inlexing function tc a document producing a set of index 
descriptors is called indejin^ • . . 

Siailaxly, there exists a Xyp^ of inverse function, 
which aaps the index space into the document space. 
g:I -> ' 

For each entry in I, there exists a set of docufflfnt 

descriptors generated ty the function, g» 

set of Jccuaent descriptors of i<k> 
= q(i<k>) = {d<1>,d<2>,.. • •d<m<k>>}<k> 

Iharefgrej the funct ion, g, -^relates a subset' of the 

documents in D having a coaaon concept represented by the 

index entry, i* The cardinality of the docuiaent descriptor, 

s, indicates the number of documents located by the mapping 

function, g* The function, g, describes thf* actioji of 

document retrieval by the generation of document 

descriptors. Consequently, g will be referred to as the 

£5t£i§Xil!3 function. ^ 

Before a more thorough analysis of the functional 
characteristics of indexing and retrieving are e^^antined, let 
us characterize some cf the properties of the sets of 
documents ani index entries. 

When the elements of the index are just single words or 
short descriptive phrases accompanying the accession code, 
then the index is related to a Uniterm index as developed by 
Taiibe {Taube,61}. If these single ter!ns can ^e reduced in 
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scope by the application of one or more levels oi subterms, 
then the index is called a coordinate index after Johnson 
{Jchnson,59j . 

' If the vindex^en tries describing document concepts'" are 
condensed into vords or phrases possibly not found in ,tho 
doci^oont itself but considered to be likely and us^^ful index 
terms^ theri the function of indexing is called as sig ned. 
The tera deriv ative indexing is used to dt^scribe the 
indexing function when the index entries are extracted fro?n 
t^he title or body o^ the document. 

Many indexes are . restricted to a fixed vocabulary. The 
index terns forming the set I are predetermined, requiring 
that t he indexing function, f , alv^^ys generate / index 
descriptors within this set for each new document added to 
the^ collection. Consequently, assigne*d indexing techniques 
are generally required for* fixed vocabulary indexes. Iij 
this restrictive sense, a fixed vocabulary index is usually 

accompanied by ' an auth orit y list which directs -the 

retrieving function to a preferred index entr^ for other 

concepts not found in the index itself. The authority list 
may be included in the index space itself in the form of 

"S§§IL^ '^£2§§ references which list the corresponding 

preferred index entry as an indirect reference. 

When a free vc ca bulary is used to create ioclex 

descriptors f-cr documents entering the collection, each 
application of the indexing function is indepen'df^nt of any 
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Other indexing operation-. The aaBltion of documents to a 
collection can. cause an increase in the number' of index 
terms found in the in^lex. Derivative inr^.exeS commonly use 
this technique. As a resurtjDf the freedom reflected in the 
indexing, function and the redundancy of natural language, a 
particular concept may appear in many places in * the index. 
Even the same word* used to describe a concept- may -appear in 
various inflectional fcris. \ useful' • restriction of the 
vocabulary freedca replaco^ inflectional variations of words 
with a ccaaon preferred forai. . , 

Let us now turn our attention to the indexing and 
retrieving functions. Some useful results can be gleaned 
from their functional relationships if first a null 
operation is defined. 

Let PHI<I> and PHI<D> represent th« null index 'entry 

and document respectively. Define ' - ' • . 

f(PHI<D>) - FHI<I> ^ 
g(PHI<I>) ^ PHI<D> 

Then the operations of union and intersection can be 

defined. (The operations will be carried out usina the 

f 

retrieving function only; however, the results hold for the 
indexing function as «ell.) 

g(i<k> UNION i<j» = g(i<k>) UNION g(i<j>) , ^ 

r 

I PHI<D> for k#j 
g*{i<k> INTEHSECT i<j>) = | * 

I g(i<k>) for k=j 
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Sin'ce the iniex entries are assumed unique, the . opecatior. of 

intersection is non-r.ull within the index space only when 

the index entries are identical. 

These twc operations lead to the foundations pf 

document retrieval through the retrieving function. If X 

and Y are subsets of index entries, then 

g(X UNION Y) = g(X) OHION g (Y) 
g ' '. INTERSECT Y) < g.(X) INTERSECT q (Y) 
where X, Y are contained in I 

The document descriptor formed by the union of two sets of 

index terns fellows trivially.. However, intersection in the 

index space is not equivalent to intersection in the 

document space. Without loss of generality, let us assume 

that the elements of X and Y can be separated into three 

distinct subsets. A, B,'and C such that 

X = A ONION E 
Y = A DNICH. C 

B INTERSECT' C = A INTERSECT ,B = A INTERSECT C 
= PHI<I> 



then. 



however , 



g (X INTERSECT Y) 

= g{(A ONION B) INTERSECT (A UNION C) ) ^ 
= g(A UNION (E INTERSECT A) UNION 

(B INTERSECT C) UNION 

(A INT.EFSECT C) ) 
.= g(A) ONION PHI<D> UNION PHI<D> ONION PHI<D> 
'= g(A) • .. , 

q(X) INTERSECT g (Y) 

= g{A UNION B") INTERSECT g(A UNION C) 
= (g(A} UNION g(B)) INTERSECT (g (A) UNION g (C) ) 
= q (A) UNION (g(B) INTERSECT g(A) ) 
, UNION '(g(B) INTERSECT g (C) ) 
UNION' (g(A) INTERSECT g (C) ) ' 



but since ■ g(A) > g (B) INTERSECT q(A) and 
,g(A) > q(A) INTERSECT q (C) , .. 



i 
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Then ^(X) INTERSECT q (Y) 

= C|(A) UNION g{3) INTERSECT gCO 

and 9.(8) INT2RSECT y (C) may be non-null 

Conseguently 

q (X) INTERSBCT q (Y) 
/ = 7(X INTERSECT Y) ONION q (B) INTERSECT g (C) 
g (X INTERSECT Y) UNION 

q(X INTERSECT -*Y) INTERSECT 
q (-.X INTERSECT Y) 

The relationships above depict the common actions 
performed by a retriever using an index. The uaion of index 
entries retrieves documents containing any of the concepts 
described by the entries* Because of the uniqueness of 
index entries, the intersection of concepts is carried out 
in the document space instead of the index space* When the 
subsets X and Y are mutually exclusive, as is the usual 
case, the desired retrieval can only be performed in the 
document space* 

When an index has. been adequately prepared, the 
retrieval tanction is represented by a mechanical procedure 
of tracing the location of the documents via the accession 
codes contained in the index entry* The performance of an 
index to accurately retrieve pertinent documents is not a 
reflection of the irechanical retrievinq function hut a 
•consequence of a poorly constructed index descriptor by the 
index in q function* 

Real iad^xinq functions suffer from two qeneral types 
of errors: 

1) attribute only a subset of the concepts found in a 
dopuraent to the document^ 



2) attribute to a docuaent a set of concepts not 
present in the dccument. 
Xhese errors may be examined formally by introducing a 
perfect indexing function^ f*. Let 

f(d) = A USICN B for all d in D 
where 

A =. {index entries describing concepts in d 
attributed to d} 

B = {index entries describing concepts not in d 
attributed to d} 

The perfect indexing function, f, would generate an index 

descriptor of the form: 

f«-(i) = A ONICN C for all d in D 

where A is defined above 

C = {index entries describing. concepts in d 
not attributed to d by f} 

Let |X| represent the number of elements in the set X. 

Then, the real index generated by applying f to the entire 

document collection can be represented as: 

I = {i=1,|D|) "UNION f(d<i>) 

= (i==1,|D|) ONION (A<i> UNION D<i>) 

Should any intersections of the sets A and B be non-ecppty, 

irrelevant documents will be retrieved when the retrieving 

fu-nction is applied to any member of that' set* That is, if 

g(A<i> INTERSECT B<j>) # PHI<D> 

for some i, j in C1,2,.*.,|Dn 

then some Irrelevant documents will be retrieved reg ard less 
of_the perfection of g> I f 



(i=1,|D|) UNION 3<i> is contained in 
{i=1 , 1 D 1) UNION A<i> 



then every retrieval will at least produce ore Cf^levant 

docuaert. The only soethod of decreasing ,the number of 

irreleva^nt documents retrieved lies in reducing the set B of 

improperly attributed document concepts - a refineinert of 

the indexing function. 

Applying the perfect indexing function to the document 

collection, a superset of the real index is built: 

I is contained in 

(i = 1, |D|) UNION f » (d<i>) 
' = (i=1,|C|) UNION (A<i> UNION C<i>) 

A non-empty intersection of th.e sets A and C leads to the 

possibility of not retrieving all the relevant documents 

pertaining to a concept described by an index entry. 

Cons.eguently > if 

i g(A<i> INTERSECT C<i>) #'PHI<D> 

I for some i, j in {1,2, . . . , 1 D|} 



then 
of 



a retrieval error occurs regardless of the; perfection 
This type of error is masked from^the user since it 
reflects relevant documents not retrieved. 

These ahstr-ict set notations can be transformed to th^ 

more j familiar measures of retrieval effectiveness of recall 

I 

and precision. 

number of relevant documents retrieved 
Pec ill = 

, number of relevant documents in data base 
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naaber ot relevant docum^^nts r«^trif»vt^d 

Precision * ^ 

noffiber of docuoents retrieved 

Let X in I represent a function of index, terms* 

Let X = .X ISTEaSECT A 

y = X INTERSECT B 
Z = X INTERSECT C 

Then the docuoents retrieved from the real index are 
g(X) = g(x UNION y) - g (x) UNIO^ g (y) 

while those retrieved fro» the ideal index are 
g(X) = g(x UNION z) = g(x) ONION g (z) 

then 

Recall = lg(x)l/lg(x) ONION g(z)| 

^ |g(x) l/(|g{x) |+|g(z) |-|q{x) INTEPSRCT g (z) |) 

Precision = |g(x)|/|g{x) UNION g (y) I 

= |g(x) |/{|g{x) |*|g(y) |-lg{x) intersect g (y) |) 

Note that recall and precision are inversely related to the 

inaccuracy of the indexing function* 

The reader should be convinced by these last arguments 

that the failures found in real document retrieval systems 

are not in the retrieval network per se. This car be 

reduced to a mechanical procedure o': performing 

transformations on accession codes. Pit best, the retrieval 

network performs in a fashion proportional to the perfection 

of the index on which it is based. Consequently, the goal 

of this thesis is to ptovide^ an automatic indexing technique 

tc produce higher quality indexes, 



CHAFTHP III. fiUTCMATEC INDEXING: A BRIHF HISTORY 

ft 

The application of derivative techniques to docufflents 
predates electronic Tachines by centuries.- Several orders 
of njonks iurincr the .12 th and 1 3 th centuries manually, 
prepared concordances f Siniinons, 63} , listir.as of each word 
with all the contexts m which it appeared ir a, docunert. 
Concordance construction is an index producing operation^ an 
indexing function that preserves the contents of th<e full 
document. However, ' such exhaustive concordances are 
dncredibly tine consuiing, tedious, and error prone tasks 
when carried out manually. A suggestion as early as 1856 
was proposed to use concordance techniques to generate an 
index from titles of document collections {Simmons, 63} , but 
the necessary manual preparation time caused the idea to ^ b'j 
dropped. 

The advent of general purpose electronic computers 
promised ncn-nu'aeric processes which could represent, 
preserve, aanipulate,/- and print textual data at 
unprecedented speeds. Because the computer could faithfully 
reproduce the textual transformations, most of the previpus 
deficiencies and clerical labor of the manual production of 
corcordance-1 ike indexes coiild be reduced to preparing a 
corpus^ of documents in machine readable foni. Even more 
radical possibilities for the potential :ise of computers was 
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envisioned by irany of the pioneers of the time* The salient 
features of scae of these systesis of indexing will b*^ 
discussed in this chapter. These methods can be generally 
classified . by the processing operations automatically 
applied to the text of a document.^ A cos pu tet- coppiled 
index is aerely an ordering of peroutations of preselected 
items _(ii^d®3c entries) presented for input. The index terras 
of even" the aost elementary form of a com puter- cre ner ated 
i nde x have been . extracted from the input' text hy some 
automated selective procedure. In either case, the ordering 
and duplicating or index terms, the compilation and 
presentation of accession codes, and the formatting and 
prinf^ng of the index are computer controlled. The am'ount 
of intellectual effort required to augment the automatic 
process is an attribute of the particular system and is not* 
amenable to general classification. 
3.1. Coa pu ter- Cp mp il gd^Indexes 

One of the first and obvious applications ^of computers 
to index construction was the manipulation of index entries 
previously selected by huma'n ' analysis. The power of a 
computerized technique of duplicating and sorting index 
entries could provide various orderinqs . and listings of 
terms for special purpose \ndexe3. For example, from tho 
same machine readable data base, a Uniterm index could be 
prepared as well as ar. author indfex. These by-products of 
machine-rea iable indexes " were recognized as being as 



important as the index itself {Olney^63}. Mot only could 

» 

duplicate copies of the index as a rfhole be prf-pac^d^ but 
the bas^is for eleaentary automated retrieval systems was 
also presents Froa a single aachine-readable Uniterm index# 
a specified subset of the index entries could be listed as a 
special purpose index ^ or^ with a slightly more 
sophisticated prograa^ listings of documents having more 
than one common index entry could be prepared. 

Completely new types of indexes, previously considered 
unmanageable because of the required tedious manual labor, 
could be considered. Fecall that the indexing function maps 
documents into index descriptors. When a uriterm index is 
constructed, each entry is a subject heading (Uniterm) 
consisting of a single keyword, or several keywords (or a 
code representing these keywords) and the document accession 
code. A new index* term can be constructed from the 

i 

concatenation of the. ter-as of the index Qcscriptor. That 
is, if ^ • - 

f(d) = {i<1>,i<2>,. ..i<n>) 
where i<j> represents a Uniterm, then the new torm i» is 

i» = i<1>i<2>...i<n> 
This ndj^^terui provides much more ififormation to the user 
•since all the descriptors ascribed to the document are 

present. Indeed, the depth of the irdex term is increased, 

«. » 
■ * 

but, if this were the only entry under which the document 
may be founJ, and the ordering of the inlex is alphabetical 

i 
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by entryr a user will ke led to document d or.ly through the 
term iS a iefinite decrease in breadth. For example^ 
titles of new books produced by some publishin-j houses fsee 
McGraw-Hiil,72) are ordered in lists by the first word' of 
the title* A title found in these lists closely models an 
index tern consisting of the concatenation of descriptors 
when each significant title word is considered to be a 
descriptor* , A solution to the problem of accession to only 
the first word of the list would be to construct a rot ated 
kgyword index ^ discussed in the next section,' . 
3. 1 • 1 • R ota t ed Ke yw ord Ind ex 

In a rotated keyword index# an index term is 
constructed beginning wi*h each uniters followed by the 
reaaining uniteras assigned to the document 'as if the terms 
were formed by successive Uniterm rotations. For exaople, 
if ^ * 

f (d) = {a,b,c} 

then 

i»<1> = abc 
ii<2> = Bca 

iK3> ='cab " 
Eotated keyword indexes retain the same breadth whil*^ 
increasing tHe depth of uniierHi indexes* Skolnik has 
demonstrated the usefulness of a rotated keyword index which 
he calls the HULTITERM index fSkolnik, 70J . Wh^n the er^^tries 
are ordered alphabetically, documents having at least on^ 
Uniterm in ccmmon are listed together. If *:wq documents 



share more t-han one unitero^they may be separated in th^ 
index . by an arbitrary number of anrelated entries which 
depends upon the order in which the uni terms were 
concatenated to fora ^ the initial index entry. The randoa 
distribution of index entries sharing ipore than one Uniterm 
reduces the effectiveness of rotated keyword indexes for 
performing coordinate searches. 

3.1.2. Ccaplet el y Per muted Keywor d In dex 
All index terms having an arbitrary number of uniterms 
in coamon are collected in a single place in a cqgipletely 
permuted keyijor d inde x . ^ Instead of forairg the cyclic 
rotations of the uniterms, the indexing function produces 
all perautations of the uniteras as index entries. 
Thus, if 

f (d) = {a,fc,c) , 

then 

i»<1> = abc 
i*<2> = acb 
iK3> = tac 
i«<4> = bca; 
i*<5> = cab ^ 
i*<6> = cba 

Coordinate searches requi'ire only one entrance into the index 
beginning with the entry associated with any combination of 
uniteras of interest. 

Coaple tely per au ted keyword indexes suffer a si 2^ 
problem and, because of this, no .concrete example can be 
cited. If an indexing function produces, on the average, n 
uniterms per document, then a rotate^ keyword index contains 
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»r <! n index er tries (oi ^ number ^ of docuw'^rts in the 
collection) while a completely permuted keyword index would 
contain ni * n! entries. Ten keywords is not an uncommon 
number to be assigned to a document. For a collection of 
one hundred thousand documents, 1,000,^00 entries would be 
included in a rotated keyword index, but each document* 
assigned 10 entries would be entered 3,628,900 times in a 
coipljetely pemuted keyword index! Although a computer may 
not be disturbed by the size of such an index, the user may 
(as would the producer paying for its creaiAon) . 
Consequently, other means for achieving coordinate searches 
were considered* 

3 • 1 . 3 . Sel ecte d li sting In Cog bination^JSLIC^, ^Index 
Ondoubtedly, a ccapletely permuted index provides for 
document retrieval through any ordering of terms assigned to 
a document, but as Sharp {Sharp^65) has pointed out, "this 
multiplicity of entries is not only extravagant but quite 
unnecessary.". The requirement of a coordinating -system is 
to provide the searcher with all combinations of terms 
pertinent to both the searcher and document concerned. All 
combinations (in the mathematical sense) of index terms 
together with a canonical scheme for representing them 
suffice as useful coordinate entries for indexing. 

To consider the indexing function, let n be the number 
of uni terms assigned to a document. The index should 
include ever y combination from i to n, every combination 



fron 2 to "n^ , and every coabirration fnom n to n of 

assigned teras* The size problem found in a coapletely 
permuted index is considerably reduced since, the total 
number of tecnts can be expressed as 
{i=1,n) SUM (c<n,i>) 

= 2**n - 1 (note for n>3 this is less than n!) 
Bach coBbinaticn of teras generated oust be unique for the 
retrieval function to operate successfully; consequently^ 
some ordering relationship must ^ be appliei ho each 
ccmbinatibn* The obvious order for an index using natural 
language tens .is alphabetical. ' Assuifting an indexer has 
assigned the terms a^b^c^ and d to a document and a canonic 
alphabetical ordering is observed^ then the index terms 
generated follows: 



1 


a 


5 


at . 


11 


abc 


2 


b 


6 


ac 


12 


abd 


3 


c 


7 


ad 


13, 


acd 


4 


a . 


8 


be 


14 


bed 






9 


fcd 










10 


cd 







If a searcher were interested In a documert - contain ir.g any 
two of the descriptors above^ say a and c, he would be led^. 
as in a permutation .index^ to this document even though it 
cpntained two addit ional descriptors* Sharp {Sharp^ 66} 
observes that a user searching for attributes ac rfould be 
satisfied by the term acd or "any entry consisting of or 
beginning with the sought terms. The ^^/^ 
superfluous as are any entries contained in any larger 
entry; consequently^ a further .reduction of index entries 



can be peraitted. Teras 1 , 2^3, 5, 6, B, and 11 can be 



eliainated leaving: 



1 i 



2 ad 

3 bd 
U cd 



5 abd 

6 acd 

7 bed 



8 abed 




still • provide all eccrdinate entries.. Since the indexi-ng 
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Figure 3.1 A portion of a SLIC index 
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function gonerates a subset' of -all combinations of index 

teris. Sharp has dubbed this method Selected Listing Tn 

Ccabinatiop (SLIC) as shown in Figure 3.-1. ' * 

It xs interesting to note that the only terms remaining 

are those cocbinations which end wi^h the last Hescriptor of 

the assigned sequence. This simplifies the calculation of 

the total number of ind<ax entries to be entered in the 

index. If the final term (d in the example above) is 

dropped from each index ^ntry, what remains is the, sum of 

all combinations of n-1 items taken 0 through n-1 times^ or 

{i=0,n-1) SOM (c<n-1,i>)^= 2**(n'-1) 

Algorithms for generating SLIC indexes have been given by 

Sharp {Sharp^66} and by Rush and Russo {Rush^71). 

SLIC techniques reduce the size of permuted indexes and 

retain coordinating ability yet still suffer from a 

multiplicity of entries when the number of assigned terms is 

largf. The SLIC method produces 512 entries for a document 

» 

assigned 10 terras: too many foe some real applica^tions. 
^ 3.1.4. PERHOTEJM Index 
-| .Garfield {Gar field ^ 55} has described an indexing 
function which corapromises some coordinating ability for 
space. Uniterms assigned to a document form two distinct 
classes: main terms, which constitute the primary access 
points to the document; subordinate ^terms ^ modifying words 
which^ specif y more clearly the sense in which a main term is 
used, for each njain term^ ah index entry is construct^^d for 



1a . 

each of the remaining uniterms assigned to t-he document as a 
coordinate main-subordinate entry. 

Assuaing that the uniterms a anil b are aair. terms of a 
document assigned concepts a,b,c, and d, then the index 
entries so gaaerated are: 

ab 

. . ad 
fca 

* be - 

bd • ' ' 

A PEBMUTEFM index collects in one place all subordinate 

entries^ alphabetically ordered, pertaining to each main 

term found in a document collection.- The indexing function 

approxiBiates a subset of a completely permuted findex (see 

'section 3.1.2) whose entries are the permutations of all 

terms taken two at a time* Of n terns ^ssigned to a 

document, assume m, 1<m<n, form the sublet, of main terms. 

The.n^umber of entries generated for bjn.s document is: 

.1 *.(n - 1) = k * n ♦ (r - lb, fyr l/r 1 k < 1 

When k maintains its average over its uniform interval of 

definition, then the number of entries generated per 

document is / 

(n**2 - 1)/2 

As employed fcy Garfie'd at the Institute for Scientific 
Information, the PEHMDTEHM index could be classified as a 
computer-ge'ierated index, discussed ;in more detail in th^ 
next seqtion . Documents are assigned keywords extracted 
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fton machire readable natural lanquage titles* Sir.gle wori 
concepts as well aa frequently encountered word pairs 
matched f roo pre-cotnpiled ^tables nay be selected as mairi 
terns. Subordinate teres are automatically determined from 
a list of coaacnly applied modifiers. Figure 3.2 displays 
an example of a PEBMDT'iRM index derived from the document 
descriptors of Figure 3#1» 
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figure i.2 A portion of a PEHttOTERM index 



3.2. ^ Computer-Generat€d Ind exes 

The preceding section has dealt with useful^ automated 
means of displaying index terms once they have been 
associated with a document. This section examines the mor^ 
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fundamental question of autoaat ically selecting index terms 
from documents of natural language text. 

Since derivative indexing techniques employ extractions 
from the document, the index descriptors must exist as some 
unit of the document itselX The most natural units of 
textual .data are words or collections of words which form 
the objective index terms. 

The underlying question which separates the techniques 
to be described is which words or phrases are to be chosen 
as representatives of the document and placed in the index. 
Of course one cculd easily argue that the ideal 
representative of a document, thus its ideal index entry, is 
the document itself. The indexing function in this case 
would do nothing but rearrange the units of the document and 
pass them to the index. The size of the index would be the 
sum of the sizes of the documents of the collection. The 



usefulness of such an index is doubtful since a\l units 



\ 



found in each document would be present in the index 

regardless of their importance to the subject matter 

/ 

discussed* Therefore, without some means of selectively 
choosing extractions from documents, computer-generated 
indexes would be of little value- 

The selection of words or phrases naturally divides the 
index units of a document into two classes: those to be 
included in the index descriptor arid those that are 
inappropriate as document representatives. Several ordering 
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relations are coBBonly applied to include or exclude units 
from these sets, A word could be chosen because^ of its form 
or position, in a docuaent - e,g, it may be included as an 
index entry if the word is capitalized and does not begin a 
sentence. The words themselves may b^e used as a clue - 
e.g. a word ils indexable if it isn't non-indexable (this 
stoplist technique cf admitting index entries will be 
discussed in section 3.2.1). Or, the statistical nature of 
the docuaent can describe its own descriptors - e.g. the ten 
most frequently found non-common words of the document can 
be chosen. 

3.2.1. Key-Mord-in-C ontext (KWICl Ind ex a nd Key-Word- 

In striving for a speedy, totally automated method of 
index construction, H. P., Luhn reasoned that the 
organization of index entries must rely on terms extracted 
from an author's text rather than assigned in accordance 
with human judgement {Luhn, 59}. The simplest form of such 
an index might be an alphabetic listing of keywords found in 
a document; however, to insure the proper meaning of such 
keywords, the user would have to refer to the text from 
which the word was extracted. To alleviate this tedious 
procedure, Luhn proposed listing selected »»keywords together 
with /surrounding yords acting as modifiers to specify the 
sense in which the keyword was applied". The added degree 
of keyword specification by such key-word-in-context , KWIC, 
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indexes is easily accomplished by automatic means. 

The keywords of a document need only be defined as 
those words vhich characterize a subject more than others. 
Since word significance is often difficult to precisely 
define^ it beccaes nore practical to reject all obviously 
non-significant words^ retaining any others as significant 
with the risk of admitting words of guestionable status, A 
list of these non-significant words^ called a stoplist, 
would include prepositions^ conjunctions^ articles^ auxilary 
verbs^ certain adjectives^ and words of little informative 
value such as "report", "theory", and the like. 

Computer-generated KWIC indexes have become an 
important tool in the maintenance of truly current awareness 
because of the speed and simplicity of the indexing method". 
The text of an aat.hor's title, a sentence from an abstract, 
or full text is submitted in machine readable form. Each 
word of the text is processed against the stoplist 
eliminating words found therein from further processing. 
The remaining presumably significant words are rotated, one 
at a time in succession, to an indexing position or keyword 
window where a snapshot of the keyword and its surrounding 
context is recorded. This process is repeated until all th^ 
text of the ^collection has been submitted. The recorded 
images are then alphabet ically arranqed by the keyword 
appearing in the indexing position and listed with as much 
surrounding context as will fit within a column un the 
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printed output page. 

Since its introtjuct ion by Luhn {Luhn^59} and Citron 
{Citron#5<^} , the KWIC index has taken on many display 
formats,, each claiming to have certain advantages. The most 
common, shown in Figure 3. 3, displays on a single line a 
centrally located keyword with the surrounding context 
"wrappetl around** to present the user with as much of the 
modifying phrase regardless of the location of the keyword 
in the sentence. This fcrmat leads the user directly to the 
keyword window allowing him the freedom to browse in the 
modifying context upon locating a keyword of interest. When 
the context following the key.word is used to further order 
index entries, nulti-nord phrases beginning with the same 
keyword are clomped together providing limited search 
capabilities for more specific concepts. However, all valid 
coordinations of wor^s producing this multi-word concept are 



Y CYLINDERS AT LIQUID ♦ FLUX JUMPS IN NICBIUM-2ISC0NIUM ALLO 
LE FOB A LUBRICANT IN A FLUX OF SOLPUSIZING GASES. = ♦SUITA3 

FOAM FRACTIONATION OF POLYMERS. = 
ISO IN NUCLEAR HARTREE- FOCK ORBITALS AND ELASTIC AND QUASIF 
. RELATION TO HARTREE- FCCK THEORY. = ♦ATOMIC POLABIZABILIT 
CGE ELECTRON P+USE OF A FOCUSSING SPECTROHETES WITH A CAMBFI 
ND DIALYZED EXTRACTS CF FODDER AND BAKER'S YEASTS. = ♦A 
APHYLOCOCCXL nuclease { FOGGI STRAIN) . ORDER OF CYANOGEN 

CONDUCTIVITY. CF COPPEB FOIL AT LOH TEMPERATURES . = ♦ON THR 
BI CRYSTALLINE ALUMINUM FOIL.=+ MIGRATION PHENOMENA IN THIN 
ECIDK. PURJPICA+DIHYDRC FOLATE REDUCTASE OF STREPTOCOCCUS FA 
OPERTI^S OF TWO 0IHYDRO FOLATE REDUCTASES FROM THE AMBTHO 
HON PRODUCT OF-DIHYDRC FOLATE. = + IpENTIFICATION AS A DEGR'ADA 
THE BHIZOSPHiRE EFFECT. FOLIAR APPLICATION OF CEHTAIN CHEMIC 

Figure 3. 3 A portion of a KilIC index ^ 
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not necessarily located at that position ill the inflex since 
the secondary descriptor may be located at some point other 
than the word iiaediately following the keyword. T^ius^ to 
locate those scattered^ lore specific concepts, the entire 
text of all titles containing this keyword aust be scanned 
to spot all occurrences of secondary descriptors. When a 
significant word appears frequently in the indexed text, 
this f oraat »ay discourage users £row the sequential 
scanning of long blocks of identical keywords. 

Many users of these indexes were unsatisfied with the 
KWIC foraat, having been accustomed to the aore traditional 
foras of subject indexes. To satisfy these users, a' 
variation of the KWIC indexing- method generates subject 
headings by extracting the keyword froa the context foraing 
a keyword-out-of-context (KHOC) index as shown in Figure 
3.4. In this figure each KiOC index entry retains the 
entire text of the title or phrase froa which the keyword 
was .extracted. Other variations may include only a portion 
of the title or phrase froa which the keyword was extracted. 
Coordinate searches are difficult to perform in these 
indexes since no sufcordering scheme is employed to collate 
secondary concepts. Thus, the user is forced to linearly 
scan *each title phrase posted beneath the extracted term for 
secondary concepts of interest. 

The single, flexible determinant of the quality and the, 
size of KHIC index lies in the words found* on the stoplist. 
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ISOLATION CF * AND BIBOSOMAL SNA FROM' HAT LIV2R.= .2afi 

♦ BASIC COMFOSITICN CF HUMAN T-STF.AIN SYCOPLASMS.= 

PHOTOPBODOCTS IN ♦ IFBADIATED IN VIVO.= * 643 

TU8N0VES OF NUCLEAR ♦ LIKE RNA IN BELLA CELLS. = 112 

EFFECTS OF METALS ON THE MECHANISM OF ACTIVATED * 
NUCLEASES. = 

USE OF A NEW METHOD TO OBSBBVB THE KINBTIC REACTIONS 

OF * NUCLEASES. = ' '»01 

DENATURATICN MAP CF POLYOMA VIRUS * .= » 2H2 

ELECTRIC CCNDUCTIVifY OF- SODIUM SALTS OF ♦ .= 648 

EFFECT OF SOME MUTAGENIC VIHOgIeNS AND CAPCINOGEHS OH 

♦ .= 131 

docosahexaemoic / 

PREDICTING THE POSITIOn/l DISTRIBUTION OF * AND 

COCOSAPBNTAENOIC ACLDS IN ANIMAL TSI GLYCEBID'ES 417 

i?igure 3.4 A portion of a KHOC index / 

Short lists, rejecting only the most obvious insignificant 
words, ad»it nany index terns of doubtful value a&l 
needlessly increase the size of the final index. ThG 
general subiect natter of a corpus of documents dictates, to 
a great degree, word usage. The vocabulary of chemistry 
differs greatly frca that of matheaatics. Stoplists 
constructed for preparing indexes of dopaient collections 
froB these fields could be expected to be similar only at 
the most corofflon word level comprising conjunctions. 
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articles^ aai a fev adjectives. Words which could be highly 
relevant to ooe. subject area may be so coaoon or 
uninforaative to another field as to appear on the index 
construction stoplist of the latter. For exaaple^ the word 
"field" carries a strict definition within aathenatical 
disciplines, while in agriculture, the word has little* 
significance. Placing a word on the stoplist which could 
generate aany index entries is a coaaion practice which 
reduces "block fatigue" and size on the one hand, but 
totally denies user access through this word on the other! 
The econoiic balance of the number of lines to be printed 
against the loss of retrieval effectiveness if words are 
omitted from the search is the critical question that must 
be decided in the establishment of. stoplists* 

To estimate the size of a KWIC index, the relative 
number of non-significant words mast fce estimated as well as 
the average number of words per document. If p is the 
fraction of significant words of a n-word document, then the 
breadth of indexing is p * n. Most KHIC indexes and some 
KWOC indexes reguire one line per entry; thus, the number of 
lines in an index of s documents is m ♦ p * n. The size of 
a KWOC index is approximately ±he same as that produced by 
KHIC indexing methods when the title is printed on a single 
line, ilhen the full title phrase is presented ir the index, 
the size estimates become more data base deper.der*t. 
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The KHIC and, to a greater extent, the KWOC indexed 
suffer froa limitations of not allowing one to perform, 
easily, an arbitrary coordinate search when large numbers of 
entries are posted with the same keyword. In general, each 
KHIC or KHCC entry aust be linearly scanned for any 
secondary concepts. If, in a KilIC index, the secondary ter-n 
iaiediately follows the primary keyword, then these entries 
are collected in that place in the index (see Figure 3.3, 
FOLATE REDUCTASE ). However, all other coordinations of 
teras are randomly scattered to the left or right of the 
primary posting. 

3.2.2. PAHDEt Index 

A relatively recent form of automatic indexing, known 
as PANDEX and published by. CCM InformaMon Corporation 
fCC«,72], incorporates terra coordination in an interesting 
variation of a KHOC index. Keywords are extracted from 
titles as iiv a KHCC iniex. The entire text of the title is 
posted as a subordinate entry ordered alphabetically by a 
secondary keyword found in the context at close proximity to 
the extracted term. Beth primary and secondary keywords" are 
printed in boldface to attract the user's eye as 
demonstrated in Figure 3.5. 

Depending upon the nature of the surrounding context, 
the boldface term constitutes a more specific concept by 
adding a significant iiord from either the right or lef*- .of 
the main term. Assume that w<0> is the primary kp.ywori 



37 

selected. The title nay then be stylized as 

...w<-3> v<-2> w<-1> w<0> w<1> w<2> v<3> ... 

where tf<i> represents a word of the title and i its 
position relative to the primary keyword. The subordinate 
term is iaaediately chosen if w<1> is a significant keyword 
(i.e. w<1> is not in the stoplist) . Otherwise, the 
subordinate concept is sought by examining w<-1>. If this 
word too is on the stoplist, w<-2> is examined. Reasoning 
that w<-1> may be a function word such as "of, "in'*, "on", 
etc., w<*2> is functionally related to w<0> producing a 
relevant concept coordination. If is non-indexable, 

the secondary keyword is sought alternately from the right 
a^d left of the keyword position. The chosen secondary 
keyword is then the first indexable word of the serjuence 

, " w<1> w<-1> w<-2> w<2> w<-3> w<3> ... 
The phrase being indexed has first and last words. 
Consequently, some members of the above seguence may be 
nonexistent. The PftNDEX construction algorithm further 
restricts the range of the secondary keyword search by 
bounding the words examined by certain punctuation found in 
the title. A colon, semicolon, or period indicate the 
introduction or termination of concepts within an index 
phrase. By limiting secondary keywords to these, bounded 
subphrases, more useful coordinate terms are chosen. 

Although the keywords are printed in boldface, the user 
must still locate them within the title which may cause as 
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THYROID 

Effect of propyl thio uracil in the survival of rat 

THYROID CELLS in vivo and in vitro. = 577 

thyro Globulin Immunity. Effect of THYROID IMMUNE and 
other prot€in-t hyroxine complexes on tissue 
concentration of labeled thyroxine and tadpole 
>o metaaorphcsis.= 71 

THYROXINE 

^Thyro Globulin Immunity. Effect of Thyroid Ii«mune and- 
other protein THYROXirE COMPLEXES on tissue 
concentration of labeled thyroxine and tadpole 
metamorphosis. = 71 

THYROXINE DEGSADATIOR. Anti-oxidant function and 
non-enzymic degradation during microsonial lipid 
per oxidation. = '-91 

Figure 3.5 A' portion of a PANDEX index 



much duress as scanning large blocks of KHIC entries. He 
may well have to scan the entire block containing a keyword 
of interest anyway since only one extra keyword is 
highlighted. The user may find clues from oth^r words of 
the^phrase. ' 
3.2.3.' Articu la ted^ Sub-je c t , Index 

The organization of both tfie KWIC and KWOC indexes lead 
a user to perform much unnecessary scanning of irrelevant 
context surrounding keywords. PANDEX, to some extent, 
overcomes this problem though still not adhering to the 
organizational structure of subject indexes or tack-of-the- 
book indexes. 

The automatic generation of subject indexes from title- 
like phrases has been studied by Armitage and lynch from 
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examinations of the- subject index to Chemical Abstracts 
[Armitage, b7) ♦ The articulated subject index consists of a 
set of subject hGadings, in alphabetical order, under which 
are indented a series of Modifying phrases or modifiers (see 
Figure 3.6)# The modifiers are listed in alphabetical order 
by theirl significant words . Comnon words such as 
preposi tionsr^ conjunctions, and articles are ignored when 
ordering the modifiers. 



Cesium 

absorption :by plants, fertilizer effect on, 60:13B13f 

by plants^ soil colloids and, 60:11321h 

by roots, Ca and, 60:12620b 
adenosine triphosphatase response to, 60:UU0Cb ^ 
adsorption of,, by Hg electrodes^ in presence of methyl- 

fo^raainine, 60: 8668c 

from radioactive waste water by clay, 60:38659 
from Na scln# by clinoptilolite, heat-treatment effect 
on, 60:15a82h 

argoid qel properties in presence of, 6G:6246e 

atomic scattering factor of, 60:7528d 

from barium*133 decay, angular correllation, 60:1283h 

base exchange of, in ales, and ag. ales*, 60:2359b 
with aiaionia on f a jasite-type zeolites, 60:7U90h 
on Bio-Rex 70 and Dowex-50W, hydration in relation to, 
60:42e 

with Ca and Li solvents in relation to, 60:7u9 3c 
with K and Na- in zeolites, 60: 13024g 
with Na in two-temp, process, 60:9951d 

Figure 3.6 A portion of an Articulated Subject 
Index* ' 



A subject heading, together with its' modifiers can be 
arranged to form a meaningful phras»« The method of 
synthesizing this descriptive phrase from an index pntry 
provides a basis for automatic construction of the index 
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«?ntries. Sove of the very words found on stoplists for KHIC 
index constrvuction - prepositions and conjunctions - 
separate the fuVl phrase into substantive phrases which ' can' 

act as subject headinqs in an index, h full phrase can be 

\ 

represented as a string of r. substantive phrases separated 

by n-1 function Moras (articulation points): 

• ' ■ 'A 

-^c-o-o-o-o- 

\ * I 

vhere ! 

I 

- indicates a substantive phrase 

... ' i 

o indicates an articulation point 

The modifiers may fce further broken into components and 
separated by coimas. - When two or more modifiers share an 
initial coaponent, the component is printed , once and the 
remaining modifiers ate indented beneath. In this manner, a 
high degree of organization is introduced into the index 
display permitting useful coordination of components with 
the subject heading. 

Thi^^ I'o^l of an articulatable phrase serves as th*? 
simplest oxanple for the logical generation of subject 
headings. The general rule for constructing index entries 
states that if one of , ft he substantive phrases is chosen as a 
-subject heading, then all possible modifiers are fo.rmed by 
choosing an adjacent functi&n word and subphrase adjacent to 
it and continuing .this selection as long as the first 
subfihrase has not been chosen. At/ each stage, sets of 
contiguous f unction-wprd-phrases may he chosen. For 
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«?X'iniple, if "books" is chosen as a subjec't heading, the 
selection of aodifiers is qiven in the exanple below: 



ARTICULATION IN INDEXES FOP BOOKS ON SCIENCE 

BOOKS 
I 



1 I 

INDEXES FOB ON SCIENCE 

I • I 

ARTICULATION IN CN SCIENCE INDEXES FOB 

I I . I ' 

t ( I 

• ON SCIENCE ARTICULATION IN ARTICULATION IN 



In standard form, 
BOOKS 

INDEXES FOE, ARTICULATION IN, OU SCIENCE / 
INDEXES FOR, ON SCIENCE, ARTICULATION IN 
ON SCIENCE, INDEXES FOR, ARTICULATION IN 



The multiple set "ARTICULATION IN INDEXES FOR" could' have 
been chosen yielding the added terms 



ARTICULATION IN INDEXES FOR, ON SCIENCE 
ON SCIENCE, AHTICUIATICN IN INDEXES FOR 



All possible index entries for this phrase are illustrated 
in Figure 3.7. 

To reconstruct the^ full descriptive phrase fron an 
index entry simply concatenate t/he cotnponents, in the order 

specified by the modifier, to / the left of the subject 

/ 

heading if the component' ends with a function word, or to 

■ / ? % 
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ARTICDLATICN / 

IN INDEXES FOR BOOKS ON SCIENCE 

BOOKS 

APTltOLATION IN INDEXES FOB, ON SCIENCE 
INDEXES FOR, ARTICOLATION IN, ON SCIENCE 

— , CN SCIENCE, ARTICULATION IN 

ON SCIENCE, ARTICULATION IN INDEXES FOR 
i — , INDEXED FOR, ARTICULATION IN 

INDEXES 

ARTICULATION IN, FOR BOOKS ON SCIENCE 
FOR BOCKS ON SCIENCE, ARTICULATION IN 
FOR BOOKS, ARTICULATION IN, ON' SCIENCE ' 

SCIENCE 

ARTICULATION IN INDEXES FOR BOOKS ON 
/BOOKS C^, ARTICULATION IN INDEXES FOR 

I , INDEXES FOR, ARTICULATION IN 

INDEXES FCR, ARTICULATION IN BOOKS ON 

Figure 3.7 All articulated index phrases generated 
froB the title "Articulation in Indexes for Books 
on Science" ^ . 



the right if tho coapcnent begins with a function word. 

A?;ticulated subject indexes are perhaps the most useful 
that could be constructed from single title-'like. phrases by 
strictly^ derivative techniques. The depth of an articulated 
subject \index , equals that of any other indexing method 
previously; disjfcussed. Its power lies in the org'a^ization 
and depth \of the entries. Coordination of subph rases can be 

perforaed 'lo the limit of discriminating . among any similat 

■ . \ ' 

phrases, regardless cf the position of the subject heading 
or component's within the full phrase. The size of the 
index, though somewhat large when compared to KWIC (see 
appendix A) , could possibly be tolerated when its usefulresn 
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is considered. Araitage and Lynch {Armitage, 67} have 
presented several rules for trimniing the number of entries 
generated per phrase, claiming to retain all useful 
coordinations. 

The major drawback to the articulated subject index 
approach is the English language itself . Not all title 
phrases follow the simple model of an articulated phrase. 
In their study of Chemical Abstracts, Armitage and Lynch 
found that only 66% of the phrases examined coiiformed to 
this "normal form" {Armitage, 67} . The most common causes 
for irregularities were: 

a) use of adjectival modifiers instead of articulated 
phrases; 

b) use of infinitives and other verb constructions. 

To perform 100% of the time, ais^KWIC indexing methods do^ 
automated articulated subject ind^x construction musl, either 
resort to automated syntactic analysis of natural language 



text or a manual editing of titles presented for input. 

The first alternative is desirous since many commercial 
institutions are prcviding document titles in computer 
r*^;adable form. An in depth syntactic analysis \of titles 
would permit the entire indexing process ' to continue 
automatically. On the other hand, to be competitive 
costwise with KWIC technigues, the computing time should be 
minimized - a highly improbable task when analyzing natural 



text. 

Manual e<3iting cf titles is equally undesirable. 
Trained ii|dexers would undoubtedly be necessary to perform 
such tasks, interjecting error, inconsistency, and cost to 
the indexing procedure. 

Young ^nd Hush {Young, 72} are examining the problems of 
automatically "normalizing" phrases through linguistic 
analysis so that articulated subject index algorithms can be 
directly applied. ^ 
3.3. Ap proach Explored in th i s Thesis 

The approach to improved index construction explored in 
this thesis combines many of the aspect^'^ of computer- 
generated and computer-compiled techniguesX The discussion 
and illustrations of section 3.2.1 hlive demonstrated the 
capabilities of the KWIC indexing technigue to provide 
immediate access to all significant words of a title; 
however, seccndary concepts must be found by searching for 
contextual relationships in the text about the keyword. The 
PERMUTEPM index, discussed ^in section 3.1.4, provides 
immediate access to seccndary concepts; however, since no 
syntax information is supplied concerning the relationship 
between the subordinate and main keywords, false retrifjvals 
may occur when the concepts described by these single 
keywords' are not related in the same manner expected by a 
user. 
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The chapters to follow discuss a refinement of the KWIC 
indexing technique which combines the iamediate secondary 
access capabilities of the PEBttOTERM indexing technique and 
the contextual relationships and automated construction ease 
of the KWIC indexing technj.que to produce ihdexes which 
approach the usefulness of articulated subject indexes. • 

1 

* \ 

\ 
\ 

\ 

■ \ 



CHAPTER IV, THE PROTCTY'OS DOUBLE-KWIC (DKWIC) COORDINATE 
INDEX 



The nee.i fo'r high-quality printed ir iexes to facilitate 
manual retrieval of inforioation has not diminished, despite 
the strides that have been made in the development of 
autoaat ic inf craaticn retrieval systems. Nevertheless, 
attempts '^to produce high-quality indexes by automated 
techniques have only recently begun to merit serious 
attention (see Chapter 3) . Perhaps the most significant 
breakthrough in this area occurred when Luhn and others 
successfully applied the keyword-in-context (KWIC) indexing 
concept as , an automated indexing technique (see section 
3.2.1)* The widespread use of KWIC indexes since that time 
and the variety of formats in which they have appeared have 
been ^reviewed by . Fischer {Fischer, 66} and others 
{Adams^ 68, Stevens, 65} . 

The rapid rise in popularity of KWIC indexes apparently 
has been due to tte high speed and low cost of producing 
i:hem. However, as noted by Fischer, there has been some 
dissatisfaction with the quality of KWIC indexes* Host of 
the attempts to improve quality have dealt with variations 
in format to improve readability or with enrichment of 
titles tc provide additional index entries which otherwise 
would not have been derived from words in the titles. 

The enrichment of titles improves the quality of KHIC 

46 
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ind'iyes by increasing the breadth of indexing. An equally 
at«-ractivf? possibility, which appears to have been little 
explored, involves extension of the KWIC indexing principle 
to provide for an increased depth of indexing. If a greater 
depth of indexing were possible, it would help to overcooe 
one of the major drawbacks of KWIC indexing, namely, 
searching for a specific concept when a large number of 
index entries are posted under a given keyword. 

One of the difficulties encountered in such a situation 
is illustrated by the set of KHIC index entries shown in 
Figure «.1 which are taken from a KHIC index of titles froa 
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INIORMATION. 

INFCEHATION. 

INFORMATION. 

INFORMATION 

INFCRSATIOM 

INFCRMATION 

INFORMATION 

INFORMATION 

INfOBMATION 

INFORMATION 

INFORMATION 

INFORMATION 

INFORMATION 

INFORMATION 

INFORMATION 

INFORMATION 

INFCRMATION 

INFORMATION 

INFORMATION 

INFORMATION 



=^STOHAGE AND VERIPICA ^3 

=^ARCH RELATING TO THE 3257 

KE 232 

ANNOONCEMENT SYSTEMS P 1'»2 

CENTER ADMINISTRATIONS B257 

DISTRIBUTION USING CO^ 121 

GROUPS - INTRODUCTORY^ 110 

MANAGEMENT IN ENGINEER B2-2 

PROGRAM. =^ORS IN BUILD 107 

RETRIEVAL SYSTEM A ND ♦ 12U 

RETRIEVAL: A COMPUTER- ^8 

SCIENCE AND TECHNOLOGY B3-2 

SCIENTISTS. =YDEMIC TRA 118 

SERVICES. =^INUING EDUC 115 

SERVICES. =+ AND INTEGR 111 
STORAGE AND RETRIEVAL^ 

SYSTEM. = EDI E 61 

SySTEflS.=^TION IN A LA 192 

SYSTEMS. = DE 101 

SYSTEMS IS CURFENT US+ B3-2 



Figure 1.1 f portion of a conventional KHIC index 
illustrating the randomizatior of secondary 
concepts found for a high-density keyword. Note 
the randofflization of concepts "TECHNICAL 
INFORMATION", '"INFOBMATION STORAGE", and 
"INFORMATION RETBIEVAL". 
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Volume 7 of th€ Journal, of Cheaical Docuaentation, Because 
these index entries are subordered on the basis of words 
immediately following the word in ' the. . index col'unr , the. - 
resulting order differs markedly from the usual ordpr one 
would find in a fcack-cf-the-book index or an articulated 
subject index. For example, several-of the entries indexed 
under «INPCR?!ATICN" indicate that the titles, deal with 
"TECHNICAL INFORMATICS," but the entries are scattered 
because of the ordering principle just described. A similar, 
situation applies to entries describing "INFoeHATIOH 
EETPIEVAL" and "IRPORHATION STORAGE" brought about by slight 
differences in title phraseology. 

In another format for the KWIC index (Figure a. 2), a 
variant of the KHOC format discussed in section 3.2.1, the 
situation is even worse. In this format, the index word is 
extracted from the title and replaced by an asterisk to 
indicate its location in the title. All of the titles, or 
portions thereof, from which a given index term -is extracted 
are then grouped together under that index term and are 
subordered on the basis of the accession numbers for the 
titles from which they are derived. This aifethod of ordering 
is worse tfcan the first, because of complete randomization 
of the words to the right as well as to the left of the 
index words. Also, this second format * makfrs it mor-:* 
difficult to determine the immediate context about the 
k'^yword wh«n scanning the individual entries, since the 



INFORflATIOli 

SSH. i. STORAGE AND VERIFICATION OF STSOCTORAL * .= 
A CHEMICALLY ORIENTIC ♦ STORAGE AND PETBIEVAL SYSTE 
BIOMEDICAL * RETRIEVAL: A COMPUTEK-BASED SYSTEM FOR 
DETEBMINlNa COSTS OF ♦ SYSTEMS. = 
FACTORS IN BUILDING AN OPERATIONAL * PROGRAM. = 
SYMPOSIUM ON ADMINISTRATION OP TECHNICAL * SERVICES 
COORDHATICN AND INTEGRATION OF TECHNICAL * SERVICE 
CONTINUING EDUCATION IN TECHNICAL * SPRVICES.= 
SALARIES AND ACADEMIC TRAINING PROGRAMS FOR * SCIEH 
THE B.F. GOODRICH ♦ RETRIEVAL SYSTEM AND' AUTOMATIC 
AUTOMATIC * DISTRIBOTICN USING COHPOTER-COMPltTED TH 
SELECTIVE ♦ AKNOBNCEMENT SYSTEMS 'FOR A LARGE COMMUH 
NIQUE NOTATION IN A LARGE-SCALE CHEMICAL ♦ SYSTEM. = 
KEYBOARDING CHEMICAL ♦ .= 

BOOK.BEVIEN: * MANAGEMENT IN ENGINEERING EDUCATION. 
OOK_REVIBM: ANNUAL EEVIEH OF ♦ SCIENCE AND TECHNOL 
lENTIFIC AND TECHNICAL ♦ SYSTEMS IN CURRENT USE.= 
HY OF RESEARCH RELATING TO THE COMMUNICATION OF * . 
BOOK_HEVIEit TECHNICAL * CENTER ADMINISTRATION, VOL 
EDITORIAL: A NATIONAL ♦ SYSTEM. = ^ 

Figure «.2 A variant fori of a K«IC (also called 
KHOC) index illustrating conplete randoaization of 
secondary concepts for the sane- titles illustrated 
iji Figure U.1 



keyword - in this case, its identifying asterisk - no longer 
appears in a fixed position. 

In another foraat for a KHOC index of these sane titles 
(Figure 4.3), the keyword is extracted and the full text of 
the altered title is posted beneath this term. The 
subordering of altered titles is arbitrary, or as shown in 
Figure 4.3, the words following the extracted term are used. 
Although .all concepts of the original title are retained, 
the randoaization of words to the left of the index term as 
well as non-contiguously to the right forces the user of a 
KWOC iiide* to scan all the text of each entry to identify 



tt3 
43 
98 
101* 
107 
110 
111 
115 
118 
124 
124 
142 
192 
232 
B2-2 
B3-2 
B3-2 
= B3-2 
8257 
E 61 



all articles (lescribing a secondary concept. 



INFORMATION i 

> 

A CHEfllCA'LLY OHIEHTED INFORMATION STORAGE AND 

RETRIEVAL SYSTEM^ T.~STORAGE ANft VERIFICATION OF 

SIRUCTO^AL * .= H'i 
B00K_aEVI3H: BIBLIOGRAPHY OF RESEARCH RELATING TO 

THi COHMONICATiON CF * .= B257 
KEYBOABDING CHEMICAL * .= 232 
SELECTIVE * ANNOUNCERENT FOR A LARGE COMMUNITY OF 

OSEPS.= ia2 
BOOK REVIEW: TECHNICAL * CENTER ADMINISTRATION, 

VOlj 3.= , • . B257 

SYMPOSIUM OH ADMINISTRATION OF TECHNICAL * GROUPS 

- INIRODUCTCRY REMARKS. = 110 
BOOK_REVIBiI: ♦ BANACEMENT IM ENGINEERING EDUCATION 

.= B2-2 
FACTORS IM BUILDING AN OPERATIONAL * PROGRAM. = 107 
B.F. GOODRICH * RETRIEVAL SYSTEM AND AUTOMATIC 

INFOBMATICN DISIRIBUTION USING COMPUTER-COMPILED 

THESAURUS AND DUAL DICTIONARY. = 124 
BIOMEDICAL * RETRIEVAL: A COMPUTER- BASED SYSTEM FOR 

INDIVIDUAL DSE.= 98 
B00K_REVI3«: ANNUAL REVIEW OF * SCIENCE AND 

TECHNOLOGY. = . . B3-2 

SALARIES AND ACADEMIC TRAINING PROGRAMS FOR 

♦ SCIENTISTS. = 118 
CONTINUING EDUCATION IN TECHNICAL * SERVICES. = 115 
COORDINATION AND INTEGRATION OF TECHNICAL * 

SERVICES. = 111 
A CHEMICALLY ORIENTED * STORAGE AND PETRIBVAL 

SYSTEM. 1. STORAGE AND VERIFICATION OF 

STRUCTURAL- INFORMATION. = 43 
EDITORIAL: A NATIONAL * SYSTEM. = E 61 

USE OF NONONIQUE NOTATION IN A LARGE-SCALE CHEMICAL 
. * SYSTEM..= 192 
DETERMINING COSTS OF * SYSTEMS. = 101 
BOOK_REVIS«: NOHCONVENTIONAL SCIENTIFIC AND 

TECHNICAL * SYSTEMS IN CURSEUT USE.= B3-2 



Figure 4.3 Another KMOC foraat illustrating 
coBplete randoiization of secondary concepts for 
the high-density concepts of Figure 4.1 
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The PANDEX format (see section 3,2. 2) for these same 
titles (Figure ^.^) leaves somethinq to be desired also. 
The PANDEX index construction, generally performs a 
coordination of a single secondary concept with the iiaain 
index term fro» a given title. The title, however, may 
contain ot hsr- secondary concepts not highlighted in the 
index phrase. In aany instances, the secondary concept 
chosen does not represent the most, appropriate subordinate 
term. The selection cf -subordinate concepts can induce 

further scattering of terms. Faur occurrences of the phrase 

■» 

"TECHNICAL INFOPNATICN" appear in the titles indexed in 
Figure 4; 4, yet only two entries specify "TECHNICAL" as the 
highlighted concept* To locate all occurrences of a n!or« 
specific concept, a user will be forced to linearly scan the 
text of all titles posted beneath the^ main heading much as- 
in a KWIC orKWOC index. 

To overcome' some tof the difficulties of these/ indexing * 
approaches, studies have been initiated by Armitage and 
Lynch {Ariitage,67} , Dolby {Dolby, 68}, and others to analyze 
the characteristics cf traditional subject indexes. Their 
approaches tend to require linguistic analysis' of titles and 
title-like phrases to effect the transformations required to* 
produce such higher^^uality indexes by automated techniques 
(see section 3.2.3). This chapter presents a more 
simplified approach to automatic preparation of higher- 
quality indexes, based on an extensior of the KWIC iridexing' 
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INFORHATION 

Keyboarding CHErtlCAL INF0K.1&TI0N .= 232 * 

Book review: Eiblioqraphy of Research Relating to 

the COrtHUNICATION cf INFOEiATION,= 825"^ 
3.F. Goodrich Information Retrieval System and 

automatic INFORMATICN DISTRIBUTION using Computer 

Compiled Thesaurus and Dual Dictionary. = . 124 ; 

Book^review: INFOPHATION rtifNAGEWENT ir. Engineer ing * ; 

Education. = * ' 32-^ 

Factors in Building an Operational INFORMATION 
PROGRAM. = , 107 

Biomedical INFORMATION RETRIEVAL: A Computer-based 

System for Inidividual Use.= , 98 

B.F. Goodrich INFORMATION RETRIEVAL System and 

Automatic Inforaition Distribution using Computer- 

Compiled Thesaurus and Dual Dictionary. = 124 
Book.review: Annual Review of INFORMATION SCIENCE 

and Tecbnology*= B3-2 
Salaries and Academic Training Programs for 

INFORMATION SCIENTISTS.^ 118 
SELECTIVE INFORMATION Announcement for a large 

Community of Users. = ^ 142 

^Coordination and Integration of Technica'l^ 

INFORMATION SERVICES. = ^ lllf 

Continuing Education if Technical INFORMATION 

SERVICES. = 115 
A Chemically Oriented INFORMATION STORAGE and 

Retrieval System. 1. Storage and Verification of 

Structural Information/.= / 
Determining Costs of INFORMATION SYSTEMS. = 101/ 
Use of Nonunigue Notation in a large-scale Chemical 

INFORMATION SYSTEM. = ' , 192 

Editorial: A National INFORMATION SYSTEM. = E 61 • 

Book^review: Noncon vent ional Scientific and 

Technical INFORMATIOli SYSTEMS in Current Use.= B3-2 
Symposium o.n Administration of TECHNICAL INFORMATION 

Groups - Introductory Remarks. = \^ 110 

Book^review': TECHNICAL INFORMATION Center \ 

Administration^ Vol 3.= ' \ B257 



Figure 4.4 A PANDEX index for the same titles of 
Figure 4, 1 illustrating partial ordering of a 
single seccndaty concept for each title where the 
secondary concept chosen is npt always the most 
appropriate one ^ 

=^ f 1 ^ 
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concept* For ^reasons which will soon become a^p^j^ent, * ye 
have chosen the. name "Double-KWIC Coordinate Index" for the 
printed indefx produced by this new approach* 
a . 1 • construction of Xhe ,pouble-KWlC_CoorMnate_Iad 

^•As illustratec^ in Figure a.S^ the double-KWIC 
coordinate index is constructed as follows: 

1) The first significant word in a title is extracted 
as* a main index'tem and replaced by an asterisk (*) to 
indicate its position in the title. 

2) The reaaining words in the title are then rotated, 
so as to permit each significant word to appear as the 
first word of a wrap-around subordinate entry under, the 
main index term* 

Steps ^ and 2 are repeated until all of the titles of a 
given bibliographic listing ace ^processed* The index 
entries so created are then sorted alphabetically, both with 
regard to rsain terms (prim^py sort) and subordinate terms 
(secondary sort) . Word significance for selection of main 

index terms and subordinate index terms is established on, 

/ 

the basis of stcplists, disbussed later. M,so, ^rnain index 
terms are not restricted *to single words, but may consist of 

j ■ ■ 

multi-word terns derived from contiguous se-ts of words itt 
the titles. - / 

To illustrate' some of the advantages of th'e iouble-KKIC 
coordinate indexina technique and tp provide/^ome comparison 
with indexing schemes described ;and illustrated m the 
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, TITLE 

I 

THE NOMENCLATURE GF KIGHLIf FLUORIDATED «0LECULf5S-, = 25 

WAIN MAIN TEHM EXTPACTSD 

TERM j 

• >NCMENCLATORE . I ' 

FLUORIDATED 10LEGUL13S.= THE-* OF HIGHLY 25 
HIGHLY FLUORIDATED MOLECULES. = THE * OF 25* 

■ , > MOLECULES. = THE * OP HIGHLY FLUORIDATED 25' 

SUBORDINATE 

TERM MOLECULES , • . 

NOMENCLATURE OP HIGHLY FLUOPIDATED * 25 
NCMENCLATURB OP HIGHLY FLO0RIDAT,ED *.= 25 
FLUORIDATED ♦ .= NOMENCLATURE OF HIGHLY 25 

PLUORIDATfD MOLECULES 

NOMENCLATURE OP-HIGHLY * .= 25 
HIGHLY ♦ .= NOMENCLATURE OF 25 

Figure a. 5 Construction of the prototype double- 
KHIC (DKWIC) coordinate index entries ' • 



introduction to this chapter^ a prototype DKWIC index ,wao 
prepared (Petrarca, 69a} froa the same titles, used for 
creating those sample illustrations^ (i.e. , those/ titles 
appearing in Volume 7 of the j2ou£!ial_^_of__£heniical 
DocumentationjL^ The prototype index was 4priv<=»d ,fraffl 71 
titles* and contained approximately 1500 primary and 
secondary access points. A KWIC index prepared froni these 
same titles contained only 350 primary access entries. 

Figure ^1.6 illustrates^, an ? annotated portion^ of the 
display foraat used -for the prototype index produced by th'^ 
double-Karc coordinate indexing schem.e discussed above. The 
complete prototype, index has ' heen published' elsewhero 
(NAPS, 69}. - • ' ' ' * 



1 5 a ■ 7 

BCOK_REVIEH<--J | |. | 

ADMINISTRATION, VOL 3,= ♦CHNICAL INFORMATION CE?<*1:ER B257 

ANALYSIS. = ♦NG NUMERICAL DATA PROJECTS A SURVEY AND B2-2 

ANALYSIS, VOL 4.= ♦YCIOPEDIA OF INDUSTRIAL CHEMICAL R25R 

ANNUAL REVIEW OF INFORMATION SCIENCE AND TECHNOLOGY B3-2 

APPLICATIONS. = *: COMPUTER ^B258 

BASIC PRINCIPLES OF CHEMISTRY. = B3-2 

BIBLIOGRAPHIC REVIEW. =.. *: SALICYLATES. A CRITICAL ' B 

BIBLIOGRAPHIC, AND CATALOG ENTRIES. = ^ING OF INDEX, B2--2 

BIBLIOGRAPHY OF RESEARCH RELATING TO THE COMMUNICA* B257 

BIOCHEMICAL PREPAR ATICNS . = *: B3-2 

?00K. = *: CHEMICAL DATA B2--2 

dOCK OF CHEMISTRY. = *r REFERENCE B2^2 

CAS TODAY. = *: B182 

CATAL0G'ENTRIES.= yING of index,' BIBLIOGRAPHIC, AND B2-2 

CENTER ADMINISTRATION, VOL 3.= YCHNICAL INFORMATION B2.5'7 

CHEHICAL ANALYSIS, VOL 4.= CYCLOPEDIA OF INDUSTRIAL B258 

CHEMICAL DATA BCOK. = B2-2 



1 - Main index tera * . 

2 - location of main index term in title beina permuted 

(rotated) for creation of .subordinate efftries. 

3 - subordinate index term 

4 - vord in wrap-around title which itninediately precedes 

subordinate index tern 

5 - truncation syobol used when words in wrap-around titlo 

do not fit in alloted field 

6 - symbol indicating tlte end of title 

7 - accession code for title represented by subordinate 

phrase. Alphabetic characters preceding the page 
number represent the foUonihg: B - book review; E 
editorial. Also, the two page-numbering systems used 
by t he,^ Journal .are represented by the following 
formats: (1) Unhyphenated - arabic numbered paqes used 
for sequential numbering ^f the pages for Volume 7; (2) 
Hyphenated - Reman numeral pages for the individual 
issues of Volume 7. The number pr?^ceding the hyphen is 
the issue number. 

Fiqure 4.6 Annotated description of thp displa^y 
format for the prototype do'uble--KWIC coordinate 

iniex iprived frcm titles in Journal of Chemical 

Documentation! Vclume 7 - • - 



/ 

/ 
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U . 2 i Ut i U ti_ cf _ the_Dc u ble- K W I C_ipK WIC 

To illustrate iscine oi the advantaqes of fh^i iouble-KWiC 
coordinate xndexing technique, Fiqure.s 4.7 /throucfti ^a*" 

display portions of the prototype DKWIC/ ind^x for 

^ ' ' ' J 

comparisons with portions of the indexes s^owr. in Figures 

/ 

U.I through 4.4 which were derived from^^ th^p same tii^les. 
Figure 4.7 illustrates the portion of th6 DKWIC index for 

the aain terni "INF0R«AT/^5N". The DKWIC index eliminates the 

/ 

randcin ordering of ^ucordinate concepts found in the KWIC 

! / 

indexiand its variant:^ (Figure 4.1 - 4.3),. The alphabetic 
ordering of subordinate concepts of the DKWIC construction 
enables one to quickly scan the subordinate index terms to 
find the particular subordinate donee pt . Since all 
significant words re>aaininq in the titles are chosen as 

I 

subordinate terias, all secondary terms/chosen for the PANDEK 
index are included iri the DKWIC index./ Mote that in the 
DKWIC index all titled pertaining to '.'TECHNICAL INFORMATION" 
are located in one place (see Figure 4.7) . 

Both the KWIC and WwiC indexes / would perniit onp to 



locate equally as well t^iose precoor dinate index "terms under 
the heading fcr the mcdij^ier iminedi^'tely "prececj ing the word 
"INPORM'ATIOV" The FAND^EX indev aids in this coordination 
by highlighting seme cf th^se important j^ords as noted in 



Figure 4.4. However, as\ illustrated in Figure 4.8, th.-^ 

DKWIC index permits itnmediatfe access to these precoordinate 

\ 

entries through the creation of multi-word main t<=^riDs. 

\ I 
I 
j 



[NFORSATIOH 

'ACADEMIC TRAINING PEOGFAMS FOR ♦ SCIENTISTS. = f AND llfl 

ADMINISTRATIOK, VOL 3.= ♦ VIEW: TECHNICAL ♦ CENTER B?57 

ADMIN ISTPATION OF TECHNICAL ♦ GROUPS - INTRODUCTORY 110 

ANNOUNCEMENT FOR A LARGE COMMUNITY OF USERS. = ♦VE * 1U2 

ANNUAL REVIEW OF ♦ SCIENCE AND TECHNOLOGY. = ♦SVIEW: 33-2 

AUTOMATIC ♦ DISTRIBUTION USING" COMPUTER-COHPIL ED TH 12U 

BIBLIOGRAPHY OE RESEARCH RELATING TO THE COHMUNICA* B257 

BIOMEDICAL * RETRIEVAL: A COMPOTER-BASED SYSTEM FO* 98 

. BOOK_REVIEW: ANNUAL REVIEW OF * SCIENCE AND TECHNO* B3-2 

BOOK_REVIgH: ♦ MANAGEMENT IN ENGINEERING EDDCATION* B2-2 

BOOK REVIEW: NONCCNVENTIONAL SCIENTIFIC AND TRCHNI* B3-2 

BOOK^REVIEW: TECHNICAL ♦ CENTER ADMINISTRATION, VO* a257 

BUILDING AN OPERATIONAL ♦ PROGRAM. = FACTORS IN 107 

CHEMICAL ♦ . = KEYBOARDING 232 

CHEMICAL ♦ SYSTEM. = ♦IQOE NOTATION IN A LARGE-SCALE 192 

CHEMICALLY ORIENTED ♦ STORAGE AND RETRIEVAL SYSTEM* ' U3 

COMMUNICATION OF ♦.= ♦Y OF RESEARCH RELATING TO THE B257 

COMMUNITY OF USERS. = ♦TRAINING PROGRAMS FOR A LARGE 142 

COMPILED THESAURUS AND DUAL DICTIONARY. = ♦ COMPUTER 12U 

COMPOTER-BASED SYSTEM FOR INDIVIDUAL nSE.= ♦EVAL: A 98 

COMPUTER-COMPILED THESAURUS AND DUAL DICTIONARY. = ♦ 124 

CONTINUING EDUCATION IN TECHNICAL ♦ SERVICES. = 115 

COORDINATION AND INTEGRATION OF TECHNICAL *• SERVIC* 111 

COSTS OF ♦ SYSTEMS. = DETERMINING- 101 

DICTIONARY. = ♦ COMPUTER COMPILED THESAURUS AND DUAL 124 

DISTRIBUTION USING OCMPUTER COMPILED THESAURUS AND* 124 

DUAL DICTIONARY. = ♦ COMPUTER COMPILED THESAURUS AND 124 

EDITORIAL: A NATIONAL ♦ SYSTEM. = P. 61 

EDUCATION. = ♦OK_REVIEW: ♦ MANAGEMENT IN ENGINEERING B2-2 

EDUCATION IN TECHNICAL ♦ SERVICES. = CONTINUING 115 

INDiVIDOAL USE.= ♦EVAL: A COMPUTER-BASED SYSTEM FOR 98 

INTEGRATION OP TECHNICAL ♦ SERVICES. = ♦DINATION AND 111 

INTRODUCTORY REMARKS. = ♦ION OF TECHNICAL * GROUPS - 110 

KEYBOARDING CHEMICAL ♦ .= 232 

NONCONVENTION AL SCIENTIFIC AND TECHNICAL ♦ SYSTEMS* B3-2 

NONUNIQUE NOTATION IN LARGE-SCALE CHEMICAL ♦ SYSTE* 1^2 

NOTATION IN LARGE-SCALE CHEMICAL *. SYSTEMS. = *NIQ(JE 192 

ORIENTED ♦ STORAGE AND RETRIEVAL SYSTEM. 1. STORAG* 43 

OPERATIONAL ♦ PROGRAM. = FACTORS IN BUILDING AN 107 

RESEARCH RELATING TO THE COMMUNICATION OF-* .= * OF B257 

Ri-TRIEVAL SYSTEM. 1. STORAGE AND VERIFICATION OF S* 43 

RETRIEVAL SYSTEM AND AUTOMATIC * DISTRinUTION USIN+ 124 

SCIENCE AND TECHNOLOGY. = *EVIEW: ANNUAL REVIEW OF * B3-2 

SCIENTIFIC AND TECHNICAL ♦ SYSTEMS IN CURP^INT USE.* E3-2 

SCIENTISTS.- * AND ACADSMTC TRAINING PROGRAMS FOP * 118 

SALARIES AND ACADEMIC TRAINING PFO^RAMS FOP * SCIR* 11.8 

SERVICES. = ♦DINATICN AND INTEGRATION OF T=;CHNICAL * 111 

SERVICES. = CONTTNrJIN''; EDUCATION IN TECHNICAL * 115. 



SELECTIVE ♦ AJ.MOUNCEKENT FOR- A LARGS COMMUNITY OF * ^ 142 

STOFAGE AMD BETPIEVAL SYSTEM. 1. STORAGE AND VERTF+ ' US 

STORAGE ^ASD VERIFlCATTCy OF STRUCTURAL * •= ♦EM. 1. 43 

SY?lPOSIUJ|gON ADMINISTRATION OF TECHNICAL * GROUPS ♦ 43 

SYSTE!1.=WlQUE NOTATICN IN A LARGE-SCALE CHEMICAL * 192 

SYSTFM* = .. » EDITORIAL: A NATIONAL * E f1- 

SYSTEM. 1. STORAGE AND VERIFICATION OF STRUCTURAL *■ 43 

SYSTEM AND AUTOMATIC * DISTRIBUTION USING COMPUTER* 124 

SYSTEMS. = .... DETERMINING COSTS OF * 101 

SYSTEMS IN CURRENT USE.= SCIENTIFIC 'AND TECHNICAL * 63-2 

TECHNICAL ♦ GROUPS - INTRODUCTORY REMARKS. = ^lON OF 110 

TECHNICAL ♦ SERVICES. = sDINATION AND; INTEGRATION OF 111 

TECHNICAL * SERVICES. = CONTINUING EDUCATION IN 115 

TECHNICAL ♦ SYSTEMS IN CURRENT USE . = / sC lENTl FIC AND B3-2 

TECHNCLOGY.= sEVTEW: ANNUAL RE'-IEH OF * SCIENCE AND B3-2 

THESAURUS AND DUAL DICTIONARY. = ♦ COf*PUTER-COMFILED 124 

TRAINING PROGRAMS FCR * SCIENTISTS. = ♦ AND ACADEMIC 118 

VERIFICATICN OF STRUCTURAL * .= ♦RM. 1. STORAGE AND 43 

Figure 4.7 DKHIC index entries for the sam*^ high- 
density term of Figure 4.1 illustrating ordered 
access to all secondary concepts represented by 
significant words in the titles ^ 



Thus, the main term "INFORMATION SYSTEM" vould appear ir the 
DKWIC index gathering related subordinate tersis and allowing 
one to quickly coordinate other concepts, as well. 

There is no theoretical upper limit, to the length of 
Bulti-^word main terns; however, a practical limit of three 
•or four words appears to be of sufficient magnitude to 



INFORMATION SYSTEM 

CHEMICAL * •= ♦ NONUNIQUE NOTATION IN A LARGE-SCALE 101 

EDITORIAL: A NATIONAL ♦ E 61 

NATIONAL * •= EDITORIAL: A F 61 

NONUNIQUE NOTATION IN A LARGE-SCALE CHEMICAL * .= ♦ 101 

NOTATION IN A LARGE-SCALE CHEMICAL * .= ♦ NONUNIQUE 101 



Figure 4. 8 Illustration of a two- word mair term 
which ftcvides immediate accpss to more specific 
concepts <v 



generate most useful multi-word concepts. Figur*^ ^.^ 
illustrates how a useful three-vord main teriD describing 
concepts scattered in each of the indexing schemes 
previously described are gathered under the term "TECHNICAL 
INPOPHATION SERVICES". 

TECHNICAL I^FOFMATICN SERVICES 

CONTINUING EDUCATION IN * . = 115 

COORDINATION AND INTEGRATION OF * .= Ill 

EDUCATION IN * . = CONTINUING 115 

INTEGRATION OP ♦ COORDINATION AND 111 

Fiofure U.9 A three-word lain term of a DKWIC index 

The use of enrichment terms to enhance the quality of 
KRIC indexes applies even more so to DKi^IC indexes. Two 
enrichment terms were added to the titles used as examples 
for the illustrations of this chapter - one for book reviews 
and one for editorials. Figure U.6 illustrates a portion of 
the subordinate entries under the main term BOOK_REVIFW. 
Note how the subordinate entries enable oqp immediately to 
locate entries for those books whose titles contain keywords 
of particular interest. Furthermore^ as illustrated in 
Figure U. 7/ access can te gained through the keywords of the 
book titles themselves - e.g. "INFORMATION". 
4.3. Stoplists for the Prototype D ou ble- KWIC Coordinate 

Three stoplists were used to preclude the appearance of 
nonsignificant main terms and subordinate terms in this 
prototype doutle-KWIC coordinate index. 



Th<^ Potential Main Term Stoplist consists oi low index- 
value words which shoulci never appear as the first word of a 
3iai^. index term, but which might appear in other positions 
of a jnain term. Included on this list are such words as 
••activities'^ "announcement", "applications", "approach'^ 
"assisted", etc# ; all prepositions, articles, and 
conjunctions; and all character strings less than three. 

The Subordinate Term Stoplist consists of words which 
should nevar appear as subordinate index terms or as the 
final word of a multi-word main index term. Included on 
this list are all prepositions, articles, and conjunctions: 
all character strings less than three; and a few words of 
extraordinarily low index value, such as "some", "such", 
etc. 

These two stoplists were invoked hy the algorithjos 
which generated the main term and subordinate term entries, 
Conseguently , these jtoplists actually prevented generation 

of index entries containing the stoplist words in the 

•J 

positions indicated above. 

The Actual Main Term Stoplist, on the other hand, was 
invoked just prior to the output formatting stage. Its 
function was to eliminate redundancy caused by generation of 
sinqle-wcrd and multi-word main terms which started with 
cotnuion word (see section 5,3), For example, the main, terms 
"AMEPICAli" and "AME^SICAN CHEKICAL" were eliminated in favor 
of tho acre specific term '"AHEHICAN CHLf^ICAL S(»riSTY" r^ir.ce 



there was complete overlap in the titles from which they 

were derivel* In other instances, the less specific term 

nay have been retained if there was incoaiplete overlaps 

U . 4 . • Advantag es and Disadvant ages of the DKWIC^ Indjsxinii 

Techniques 

Some of the advantages of the double-KWIC coordinate 
indexing technique as compared to the KHIC indexing 
technique and . its variants have already been cited. 
Briefly, they cay be suaaarized as follows: 

1) The double-KHIC technique provides a greater depth 
of indexing. 

2) Coordinate searches can be performed more easily on 
QOuble-KHIC coordinate index eutries, both because of 
the foraat and because of the alphabetic ordering of 
the subordinate terms under the main index terms. 
False coordinations are unlikely, as in the PEHttUTERM 
index (sections 3.1.4 and 3.3), because contextual 
relationships between the main terms and the 
subordinate terms are preserved in each index entry. 

3) Class relationships can be expressed by use of 
enrich.-ttent terms. When these enrichment terms appear 
as main headings, the members of the class ax'i 
differentiated on the basis of the subordinate index 
terms. Specific members of a class can also be 
dccess<i^ through main headings describing the specific 
members of the class. 
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U) The forntat of the double-KWIC co'ordinate index entry 
is more readafclG^ because it closely resembles the 
foraat of a conventional subject index entry. 
The major disadvantages of the double-KWIC coordinate 
indexing technique over Uie conventional KWIC indexing 
technique are the increased index size and the higher costs 
of index production. For example, the prototype index 
occupied apprpximately four times the space required by a 
coaparable KHIC index. Despite such an increased s^Z9. 
relative to the conventional KHIC index, cost-return 
bene:^its could i#ell justify the use of DKWIC indexes in 
place of conventional KWIC indexes in many places. 

The real value of the double-KHIC coordinate indexing 
technique can be appreciated when it is compared with the 
automated articulated subject indexing technique for 
generating index entries from a given set of titles or 
title-like phrases, ihe DKHIC entries approach the quality 
of articulated entries but because of their ease of 
construction^ which lack extensive linguistic analysis, they 
could be produced at considerably reduced cost. 

* • 5^ f I2l2tl£e_S^xSi£I_I §§i9Ii 

The examples cf double-KKIC coordinate indexes 
displayed in Figures U.7 through 4.Q are portions of the 
prototype index autciiaticaliy generated by the first 
programming procedures - developed to produce ioub^e-KHIC 
coordinate indexes. the system designed, to cr«=^ate. this 



prototype iouble-KWIC coordinate index- was as follows (see 
Figure U.10). The first phase required generation of KWIC 
index records froa the source titles. Since all of the 
words appearing in the index column of the conventional KWIC 
index would become caMidates_f or_po-tential--a^-i^^ 
the double-KWiC coordinate index, the main term stoplist was 
invoiced in the KWIC index step to preclude* creation of index 
entries for all words of low index value which were not to 
appear as the first word of a main index term in the DKWIC 
index. Potential main terms for the DKWIC index were 
generated in the second step by extracting individual 
keywords or phrases (word strings) from the index column of 
the KMIC index. After each potential main term was 
extracted^ the remaining portion of the title was rotated so 
as to create permuted subordinate entries. The subordinate 
stoplist consisting primarily of articles, prepositions, and 
conjunctions precluded generation of subordinate entries 
beginning with words appearing in this stoplist. 

The algorithm for generating potential main terms 
(PMTs) defined a word as a string of characters bounded by 
spaces. A PMT could consist of a single word or a set of 
contiguous liords up to some specified upper limit. If a 
punctuation mark occurred at the endiof any wori, it was 
removed during creation of a potential main term. Also, 
word strings for which the last word of the string was on 
the subordinate stoflist were rot generated as potential 
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Figure 1.10 System design for creating the 
prototype DKHIC index 

oain terms. 

The iniex records generated by the above procedures 
were sorted first on the basiis of potential main terms and 
then on the basis of the words in the subordinate ertry, 
Proa this sorted file, a printed list of all potential tnain 
terms generated by the procedure was obtained, so that the 
indexer"" coali choos€ the actual main terms which would 
appear in the final index (see section 5.3 and Figure 5.6 
for further explanation of this process) . These selections 
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were aade daring the final print phase via sequence numbers 

L 

assigned to the potential nain teras in the printed list. 
Selection, of PMTs by sequence number rather than by stopTist 
(see section 4.3) proved siapler, since, on the average, 
fewer PJTs were selected than were rejected. 



CHAPTER V. SVALUATICN -AND MCDI FICAT IGN OF THE PROTOTYPE 
SYSTEM: TEB KWOC-DKWIC HYBPIC'INOKX 

The first! application ' of the ' prototype double-KWK^ 
coordinate index algcritha provided a model to illustrate 
_the-T3fX)tential advantages of tnis nev automatic indexing 
technique [Petrarca, 68a} . Portions of this first index'^are 
displayed and discassed in Chapter U. The construction of 
this and other indexes also ptovided opportunities for 
evaluation of the prototype aethod and suggested a number of 
vajs in which the acdel coiild refined. One immediately 
obvious refinement jsertained to the often encountered 
situation illustrated in Figure 5.1 where the perauted 
subordinate terms under the' main term were all derived from 
the same title. Obviously, there is little justification 



^LIBRARY OP CCNGRESS 

APPLICATIONS IN THE * SCIENCE AND TECHNOLOGY COMPOTEB 61 

'COMFOTER APPLICATIONS IN THE * SOIENCB AND TECHNOLOG* 63 

■ DIVISION. = ♦LICATICNS IN THE * SCIENCE AND TECHNOLOGY. 63 

\ SCIENCE AND TECHNOLOGY DIVISION. = ♦LICATIONS IN THE * 6.3 

•TECHNOLOGY DIVISION. = ♦LICATIONS IN THE * SCIENCE AND 63 
LINGOISTIC ANALYSIS 

INFORMATION RETRIEVAL. =. LINGUIA: A * SYSTE.^. FOR '207 

'LINGOIA: A * SYSTEM FOR INFORMATION RETRIEVAL. =..;.. . 207 

E&TRIEVAL.= LINGUIA: A * SYSTEM FOP INFORMATION 207 

SYSTEM FOR IKPOHMATICN RETfiIEVAL. = LINGUIA: A * 207 

j 

Figure 5.1 Size-fcallboning. effect in the prototype 
DKMIC index caused by permuting subordinate 
entries under main ; terms derived from only a 
single title 
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foe ballooning the size of/ the index by permuting 

subordinate entries in, situations like this. Another 

observation is illustrated in Figure 5.2 for those cases 

' • / 

where an indexable word or phrase occurs more than once/ in a 

.title. The title froo which these particular entrxe;^ werf» 

created , contained * the word "INDEX" twice. For each 

occurrence, it jt^^extracted as if it yf^e a different tnain 

tera. Subseguent rotations ^ of the remaining words in the 

title produced a stuttlering effect through pairs of nearly 

identical subordinate entries in the resulting indir.. 

Observations such as those just described Med to 

reexaaination of the approach used to construct 
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Figure 5.2 Stutterinq"ef feet and size-ballponing 
effect in the prototype DKMIC index caused by 
pernutal subordinate entries for a main terir, which 
appears nore than once in a title 
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the prototype aodel. * . ^ * ' 

The .above problecDS obviously resulted from too closG 
adherance to the principles of KWIC index construction. 
Qnce a potential aain tera was extracted from a title tho 
remaining portion of. the' title was .always permute! 
regardless of whether the potential main, terra occurred mor*? 
than once in a given title^ or whether it occurred only once 
in the entire set of titles being indexed, Ful-ly rotated 
subordinate entries were constructed for all potential main 
terns "whether or not they were selected for inclusion in the 
final index. . This . irdiscrimi nate approach to permuted 
subordinate entry construction not only created the problems 
mentioned before (Figure 5^1 and Figure 5.2), but also 
needlessly increased the cost of constructing the index* 
Although seme of these -problems had been anticipated 
^beforehand, it was decided to generate all second order 
permutations cf the titles for the {)rototype index on the 
premise that word and phrase / patterns ; generated by these 
permutations might provide some 'insight into the problems of 

mair* term and subordinate term selection. 

I k ♦ * 

5.1. Thg_Modif iedSjfStem^J^^ of K M OC^DK W IC 

Hybrid Indejces , . • 

To overcome many of th^ problems encountered in the 

prototype model ^ a slightly different approach for 

construction of the doufcle-KWIC coordinate index flay, 70} 

was used which produces a KWOC-DKWIC hybrid index. Th^ 
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basic difference between the prototyf.p and the modified 
approach are: . ^ ' 

1) The potehtial main terns are now extracted directly 
from the titles (or title-l^ke ph'rases) instead of from 
a KWI.C index of the titles^X and the potential index 
entries so created are temporarily retained in a KWOC- 
type foraat until ot^er conditions are examined; 

2) After all of tne titles have been processed and the 
actual isai|n teras have been selected, if the number of 
titles contaxnxng a particular aam tern exceeds an 
arbitrarily assigned threshold value, conventional 
double-KWIC (perouted) entries are created; if the 
threshold value is not exceeded, KWOC-type (non- 
perauted) subordinate entries are created. 

The above processes arr illustrated conceptually in Figure 

^.3 while the system derign chart for the data flow in the 

KHOC-DKVilC <upprdach is ill&strated in Figure 5.4. The new 

('esign consists ""of two phases* each terminating with an 

alphabetic scrt of records produced by that step. ^The first 

phase generates all potential- main terms from the titles 

fcein^' indexed* \The second ' phase is directed towards 

selection of actual '^maiii terms and creation of petnuted 

subordinate entries which are to appear in the final index. ^ 

5 • 2 . Extraction of , pote nti al Main T.ermo (PHTs) 

The first phase consists of the extraction of all 

potential main terras froiu the titles being ir.dexed, \»or4 -.i^' 

\ 



Titles 



\ 



1. COf.PARIMG INDEXING EFFICIENCY : AN D CONSISTENCY . = 26 

2. DOCUMENT REPRESENTATION AND INDEXING \CONSISTENCY. = 56 



Step 1: Fote^.tial Index Entries ' 

DOCU^IENT // * REPRESENTATTOH AND I N DEXING.^ CONS ISTE NCY. = 56 
INDEXING // COMPARING * EFFICIENCY AND CONSISTENCY, = 2*^ 
INDEXING // D^^UMENT REPRESENTATION AND * CONSIS^NCY. = 56 
INDEXING EFFiqpNCY // COMPARING * AND CONSISTENCT, = 25 

J I ■ I I 



Step 2: Actual Index Entries 
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INDEXING < » ; 

COMPARING * EFFICIENCY AND CONSISTENCY. = 25 ^ 

CONSISTENCY. = CCMPARING * EFFICIENCY AND^ 25' | 

* CONSISTENCY. = DOCUMENT REPRESENTATION AND 56 | 
DOCUMENT REPRESENTATION AND t CONSISTFNCY.= ; 56 | 

* EFFICIE.^CY AND CONSISTENCY. = COMPARING 25 ; | 
REPRESENTATION AND * CONSISTENCY • = DOCDMENT '56 -i 



DOCUMENT REPRESENTATION 

♦ AND INDEXING CONSISTENCY. 



= 56 <- 
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1 Potential main terra extracted directly from tlitl'^ 

2 - K»OC-tYpe ^ subordinate entry stored temporariHy with PMT 

3 - Actual aain tera selected froa potential main! terms 

a - PeriDUted subordinate entries created from; KWOC--type 
entries ^ \ 

5 - Non-pernuted sUrfeordinate entry i 

6 - Location of the extracted main term ■ 

7 - Accession code 



Figure 5. 3 
construction 
h ybrid index 



Annotated description of the 
of' -index terras for the KWOC-DKWIC 
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Figure 5.4 System design for creating the KWOC- 
BKWIC hybrid index 



significance still being based on appropriate stoplists. 
The algorithm for generating potential main terms was 
sodif i*>d to define a word as a string of charsctsrs boutidyd 
by a set of deliniters. These delimiters are partitioned 
into two classes, terminal and non- terminal , and th(= 
function of each is described below in conjunction with 
criteria used fo construction of PMTs. For a clearer 
understdnding of these criteria the reader is referred to 
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Piqure 5.5 which provivies, several examples illustra t in-:? 
their applic3tion, 

A potential main ternj .nay consist of a sinqle word or a 
set of contiguous wor^is up to some specified upper limit 
(indicated by a user input parameter); it must have the 
following three attritutes: 

1) The first word of the potential main term must not 
be on the main term stoplist or on the subordinate term 
stoplist; 

2) The last word of a candidate contiguous set must not 
he on the subordinate Stoplist; 

3) All words in a candidate contiguous set must be 
separated by non-terminal word delimiters. 

The first and second attributes are the same as those 
previously required in the K«IC indexing and potential main 
term generation phases^ respectively^ of the prototype DKWIC 
system. ^ Because certain punctuation marks between words 
usually signify introduction of a new concept^ requirement 
of the third attribute was introduced to assure generation 
of potential maiX^ terms describing only a single concept. 
Finally the n^w approach identifies multiple. occurrences of 
a potential main term in any particular title being indexed; 
hence^ only unique potential main terms are generated from a 
given title. 

Figure 5.5 illustrates the potential main terms that 
would be generated from a title on the basis of the above 
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Word Celimiters 

Terminal "•,;:?()!" 
Non-terisinal " 

litis 

DASAR*. COMPOTRR-BASEC DATA STORAGE AND DATA RETRIEVAL. = 62 

Potential Mdin Teras 

DASAR 
* COKPOTER 
COflPDTER ^SED 
COMFDTER BASED DATA 
PATA 

DATA STORAGE 
STORAGE 

STORAGE AND DATA 
RETRIEVAL 

SoBe Potential Index Entries 

DATA // CASAR: A CC «POTE R-BASED ♦ STORAGE AND ♦ 

RETRIEVAL. = 62 
DATA STORAGE // DASAR: A COMPUTER-BASED ♦ AND DATA 

RETRIEVAL.^ 62 

Figure 5.5 Illustration of the effect of word 
delxaiters and selection criteria on generation of 
potential main terms and potential index entries 
from a title. The EMTs are sequenced on the basis 
of the order in which they would be generated froa * 
each significant starting word in the title. . The 
word "BASED" appeared on the priaary stoplist and 
"AND" is on the secondary stoplist. 



criteria. For one of the potential index en^rier, 
illustrated therein, note the treatment for icultipl^ 
occurrences of a main tero "DATA" in a titlo. This 
treataen't precluded tfce possibility of generating groups of 
nearly identical subordinate entries to produce the splutter- 
ing effect encountered in the initial iroiel (Figur^^* 5.2). 
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5. 3. Hupar. Interface Segu ire monts for thp Selection of 

let^illlB^iii-Ilsiii-ZiAfllill-iil iw£CHDKiic_Threshpid 

Values ^ ^ ""^^ 

After all the titles from a given source have been 
processed and the potential index entries have been sorted, 
a printed list of all potential main torins, referenced"" by 
sequence nuabery. is prepared. This list also includes 
frequency data for the number of titles in which that 
particular maiTn term occurs. Figurtf 5.6 illustrates some 
potential -a-a in term listinqs from a particular production 
run. 
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Figure 5.6 A portion of a PUT list and occurrence 
frequency data used for selection of actual main 
terms 



Atr this point, a human interface Ftep is required for 
selection of the actual main terms which are to appear in 
the KHOC-DK'rflC index. The sequence numbers for tho desired 
main te^rms (e.g. #2^ and #33 in Figure 5.^:) and the 
threshold value for controlling the relative number of 
permuted and non-permuted subordinate entries are supplied 
as input parameters to the next processing step. The actual 
entries for ♦■he index are then sorted and printed in 



accordance with any prpviously supplied display 
soecif ications (see Figure 5.7). 



INDEX 

the doubls-kkic cooedinatr * ii. use of an 
autcmaticAlly genebated authority list to eliminate 

■ SCATTEPING CAUSED BY SOME SINGULAR AND PLUPAL ".AIN 
fERMS. = 277 

INDEXING 

COHPARING * EFFICIENCY AND CONSISTENCY. = 32'i 

CONSISTENCY. = GOnPARING * EFFICIENCY AND 324 

♦ CONSISTENCY. = DOCUMENT REPRESENTATION AND 238 

DOCUHENT REPRESENTATION AND ♦ CONS ISTENCY. = 32U 

* EFFICIENCY AND CONSISTENCY. = COMPARING 32U 

REPRESENTATION AND * CC NSISTENCY. = DOCUMKNT 238 



figure 5.7 Exaaple of two types of suboriinate 
entries found in a KHOC-DKMIC hybrid index 



5 . . Other Features_c£ the KjOC-ppiC. Hybrid Systen 

An additional display feature for permuted subordinate 
entries under the new approach (Figure 5.7) enables one to 
more easily identify certain proximity relationships (ani 
hence, semantic relationships) between main terns and 
subordinate terms. This is accoaplished by displaying the 
replacement asterisk for the 3iain term in the left hani 
laarqin of the subordinate entry when the main ter^i 
iamediately precedes the first word of the wrap-around 
entry. 

The new systems design enables one to produce* a range 
pf index types nhich vary in size, quality (i.e., degree of 
Kaocness a^*d UKWxCress), and cost. This can be accom^plishel 
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sinply by varying the threshold value which controls the 
relative nuniber of permuted and non-permuted subordinate 
entries. For exaople by using a threshold value of zero one 
can produce an index in «^ich all of the subordinate entries; 
are permuted as was exemplified by the prototype index. On 
the other hand, by using an extraordinarily high threshold 
_ value one can produce a straight KWOC index. In between 
these t«o extremes, indexes with varying . coaibinations of 
both types of entries can be generated. By using KWOC-type 
subordinate entries a considerable reduction in the size of 
the printed index results. But, as the number of KWOC-type 
subordinate entries under a main term is increased, the 
accessibility to subordinate concepts described therein is 
significantly reduced and the advantages of double-KWIC 
coordinate indexing are lost. However, if one is willing; to 
concede that accessibility to subordinate concepts is not 
significantly reduced when the number of K«OC-type 
subordinate entries is small, one can achieve a significant 
reduction in the overall size and cost of the printed index 
by using a low threshold value for controlling the 
generation of f^ermutfjd" and non-permuted subordinate ontrif?s. 
For exaaple, the si::e cf the prototype index (section a. 4) 
was reduced by 40% sisply by using a threshold value of one. 
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CHAPTE? VI. VOCABOLAFY CONTROI/OoR NATOPAL LANGUAGE 
INDEXING 



Proponents defend derivative indexing techniques not 
only because of the relative speed and ease of indexing 
large quantities of documents, but also because of the 
novelty and currency cf the vocabulary used to construct the 
index entries themselves. Kennedy {Kennedy, 63} claias "the 
use of the author's own terms - the alive currency of new 
ideas - rath^ than the considered reshapings to the 
indexing system may often be of great advantage." New 
concepts described by new words or new uses of words would 
rightfully find their place in the derivative indexes 
described earlier. Traditional indexing techniques would bo 
forced to- map these new concepts into previously established 
categories masking much of their usefulness. Several of the 
indexes discussed, notably K»IC, which contain tho context 
about a keyword or phrase, present the user with a 
"suggestiveness"^ concerning other concepts or relations 
which exist in the remaining phra^^f^ . From these- 
correlations the user may be led to other equally relevant 
access points in the index. 

This very vocabulary freedom ha^ also been cited as a 
common complaint of derivative techniques. The methods; 
d4scriLed operate on wgrds with an equivalence relation 
[s*9d solely upon the character makeup of th.e words. 

77 
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Synonyms^ homonytns^ eponyos^ an6 neologisms canrot be 
resolved by machire without further in-dept*h a'^.aly^jis of the 
text presented for indexing. The ojachine's inability to 
resolve these language redundancies result in the scattcrina 
of index tem.^for a given topic throughout the index with 
the danger ot possible retrieval loss by the user since he 
oust anticipate each author's word usage. 

The types of scattering occurring in derivative indexes 
can .be classified accordinq to the construct causing the 
scattering. Inflectional scat terin g is the result of word^ 
having the same prefix and word stem, but differing in 
inflectional ending. The words automate, automates, 
automated, autc^matic, automatically, and autoaation all 
refer to similar concepts yet may te scattered in the index 
because of terminal spelling differences. f^ore serious 
problems occur in s y ncnymal scatteri ng^ synonyms or near- 
synonyms which become separated in the index due to stem 
spelling differences. 

The scattering in free vocabulary indexes qan be 
reduced efficiently in two phases. For each access word in 
the index, first delete all causes of inflectional 
scattering, then, having retrieved the word stem, resolve 
any synonynal index scattering. The next two sections of 
this chapter present m^t hor^^^s for redact ion of index 
scattering. • ^ 
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The constituent words of an iniex descriptor are 



composed of an informative stem prefixed to variant 
character strings which merely enable t his^nf ormation to b^* 
expressed in grammatical form. When these stem suffixes; 
participate as patt of the cpllatirg sequence for ordering 
index descriptors in a printed index, inf lection^il 
scattering occurs as illustrated by some KWIC index entries 
from an issue s>f Cheiical Titles containin^^the entries PA? 
and RATS separated by several pages of unrelated titles 
(Figure 6.1). Consequently, inflectional scatterinq can be 
resolled by identifying and eliminating grammatical endings^ 
of words participating in the index collection. 
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OF CONSTANT AfiSORPTICN RATES. = ♦TRAKSPEP UNDER CONDITIONS 
ODIUM CHLORIDE IN GRAIN RATION ON THE FREEZING POINT OF MILK 
ADRENALINE SYNTHESIS IN RATS AFTER RESERPINE TR EATMENT. = +0R 
EN SULFATE FORMATION IN RATS AND MICE *-IN ANDRiJG 

Figure 6.1 Inflectional scattering in a KWIC index 
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The conseguenceS; of inflectional scattering are eqdally 
apparent in the liouble-KWIC coordinate indexing technique. 
The main index terms are derived strictly pn the basis of 
words which actually appear in the titles processed. This^' 
causes- some scattering of information when two or more main 
terms ^^ntaiVi .the same word root b^^t di fferent ' inflectional ^ 
endings. A portion of the prototype index where s;uch 
scattering was observed because of the occurrence of 
singular and plural word foras is illustrated in Figure 6.2. 
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^ NATIONAL * .= ..j /EDITORIAL: A E 61 

NONONIQOE NOT/fTICN IN A LARGE-SCALE CHEMICAL > .= ♦A 192 

OSE 0? A NGNUNIQOE NOTATION IN A LARGE-SCALE CHBMIC^ 192 

INFORMATION SYSTEMS 

B00K_RE7IEW: NONCON VENTIONAL SCIENTIFIC AND TECHNIC* B3-2 

COSTS- OP ♦ .= DETERMI.VIN^6 101 

CORRENT OSE.= ♦NTIONAL SCIENTIFIC AND TECH HICAX * N B3-2 

DETfiRMIHIN3= COSTS OF ♦ ,= A..,'.. 101 

NONCONVENTIONAL SCIENTIFIC AND TECHNICAL ♦ IN' CURSE+ B3-2 

SCIENTIFIC AND TECHNICAL * .IN CURRENT USE.= +NTIONAL B3-2 

TECHNICAL * Ibf CURRENT tJS3.= ♦NTIONAL SCIENTIFIC AND B3-2 

USE.= ♦NTICNAL SCIENTIFIC AND TECHNICAL ♦ IN CURRENT B3-2 

i 

I? 

Figure 6.2 \ portion of the prototype DKWIC index 
illuscra^ting scattering due to the occurrence of 
singular. and plural word f orms ^ 

f. ^ _ -- . _____ 

Inflectional scattering can be remedied by a stemming 

algoritha; which is a cciputational procedure to reduce all 

» 

words with the same root to a comjDon form^ usually by 

ft 

stripping each word- of its derivational anjl inflectional 
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suffixes • A standard approach to . stpmining algorithms 
retrieves the^ stem of a vord by reaoving an endiing wJfich 
matches a list of stored suffix^^s.. Two. niain>^ principles 
direct the oiatchinq of word endings: iteration >and longest 
match. 

^ \ 

An iterative alqoritha is^ as its 'r.ame lln^liGs, a ^ 
repeated removal of character strings- affixed to a word. 
Lejrtieks {Le jnieks^ 67} observed that suffixes are" attached- 
to I word stems in a certain order, /that is, th<^re ^xists 
ord:er-classes of suffixes. A match is sought^Wlth an\,endiny 
in the teminal crder-class (that order-class containing 
suffixes which are found at the end of words), the ending is 
rejnoved, and the process repeated with the ne^t order-class 
unjiil'np mqre matches are found. 'A strictly iterative 
technique, may require many order-classes whose members mat'y 
be! difficult to ascertain. . ' ^ 

1 ' ■ ■ ■ • ^ • '• ' - 

t. The longest-match principle requires a single order- 
cl4ss» ' If mere than one. ending from- this ord^er-Jlass- 
matjches a wori suffix, the longest is reisoved. T>i5 
principle is easily iiirplemented fcy jscanning the endings^n* 
order of decreasing length. Longest -fmat,ch algorithms entail 
'th€ generation of all possible comibinations of af/fixes which 
requires.^ niuch higher storage overhead than xhe shorter 



lisjts of iterative approaches. 



/ 



r A suffix match may not always be a suffiyiert condivion 
for en^dinq removal with either algorithm, pu^iilitative and 
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quantitative context-sensitive conditions associated with a 
particular suffix raay be necessary to limit th? 
applicability? of suffix deletion. Thfi "context" refers, 
qualitatively,. to the type of - characters' and, 
quantitatively, |to the number of characters of t^he remkiniag 
stem 'if the endiVig is rcflsoved. / 

Tukey . {Tuk^y,68} has proposed a contapct-sens.it ive, 
partially iterative, a^ultilingual steminincf algorithia whose 
) endings are divided into four order-classes. It is 
structurally coaplex requiring distinct matching procedures 
for each order-class and context-sensit-ive case. 

Salton fSalton,6Bt} and Lefsk {Lesk,66} have described, a 

, i 

Stem and ending, longest-match, dictionary approach. .The 

I ^ ' 

stea IS sought by matching a complete (entry from a st4[n 
dictionary wit^h the first k characters of the word,/ The 
suffix, .beginning with the k+1 st character must appear . in 
an—, 'ending dic':icnary before the stem-ending pair is 
accepted. The single context-sensitive condition of stem;? 
dictionary aiatch can be easily handled by program, but the 
required dictionaries severely limit the algorithm •;3 
generality; . ' 

! Lovins {Levins, 68) combines the iterative and longest- 

. match techniques to good advantage and,' with the adf^.ition of 
a context-sensitive recorUag' algorit hm,- cures many spelling 
exceptions wMch occur '^ihen some suffixes are attached to 
words. i 
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6.1.1. Stemain q and Recoding foe , Printed Indexes 
The stemming techniques cited above are concernei with 
the algorithaic r(^trieval of vord stems regardless of their 
form. The user of a printed index, unfamiliar with 
retrieval by stems, may be somewhat confused by descriptors 
composed of word stems. Consequently , at least foe prin: i 
indexes, the stem must be recoded to form a word 
recognizable by the user. Words having similar^ stems must 
be similarly recoded to avoid inter Je<:ting secondary 
scattering. 

Two possible approaches to recoding stems seem 
available: 

1) using the stem, enter ca dictionary and retrieve the 
preferted suffix - the reverse of Salton's technique 
for stemming, or, 

2) the ending itself may be associated with a\ preferred 
suffix substitute. 

The latter seems most appropriate because of its general 
applicability and lack of sizable stem dictionaries. 

To attack the problems of stemming and recoding for 
printed ind'axes, a small subset of the possible inflectional 
endings was chosen for experimental study. Title phrases 
generally abound with nouns and nominal phrases. A high 
percentage of inflectional scattering in printed title- 
indexes results from the occurrence of the same nominal stem 
in singular and plural forms. The stemming-*recodiny 
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technique to be described is presently liinited to plural 
' fortBS ending in "s"; however, the technique may be expanded 
rpadily to ot^er inflectional endings. 

6.1.2. Plural-Singular ^St emmin q-Recoding Algorithm 
An initial solution to inflectional - scattering 
automatically generates <teingular words- from plurals ending 
in "s" {Petrarca, 68fc} . A ^o't^ transformation routine, 
constructed. empirrCally from the examination of the stemming 
algorithms mentioned above and a reverse English dictionary 
{3roWn,63), acts' on vords ending in "s", and performs two 
functions: 1) decides whether the word —is transformable 
(i.e. is a plural of a singular concept) ; and, 2) if the 
word is transformable, generates the singular form. 

The algorithm identifies the' transf ormability of a word' 
by examining only a few chara<:ter£=: preceding the final "s" 
and derives the singular either algorithmically or by 
consultation of an appropriate exception list. The 
description of the algorithm, -given below, is divided into 
three parts, each describing the ^action taken based on the 
number of letters previously scanned. The prescription for 
fbrmi'^ng the singular concept is given at each point where ^ 
transformable decision can be made. 

Second to the last character is: 

1) "s","u" 

the word is not transformable (e.g. stress, thesaurus, 
etc* > 

2) "a","o" 

an exception list is examined for nontransf orinablo 



words (e.g. atlas, pathos, etc.)* If the word is not 
found, the final "s" is dropped (e.g. spatulas, zeros, 
etc.) • 
3) "i" 

if the third to the last charact^^r is "s", the word is 
not trar.sforsatle (e.g* analysiis, thesis, etc,) ; 
otherwise, the exception list aentioned in case 2 is 
•examined for nont ransf ormable words (e.g. this, etc.)- 
If not present, the final "s^» is dropped 
(e.g. martinis, etc.). 

the singular, non-possesive word is formed by dropping 
the "'s". 

5) «e" 

the third to the last character must be examined before 
a decision can be made (next section). 

6) "'**^all other letters" 

an exception list is examired for nontransf ormable 
words ending in "consonant s" (e.g. physics, MEDLARS , 
etc.). If the word is^ not found, the final *"s" is^ 
dropped {e.g. appears, admits, etc«). 

Third to the last character is: 

7) "e","u" 

the singular word is formed by dropping the final "s" 
(e.g^ trees, clues, etc.) ♦ 

8) "h" 

the singular is formed by dropping the final "es" 
(e.g» searches, etc.). 

9) "V" 

if the fourth to the last character is "1", the "v" is 
.changed to "f" and the "es" is dropped to form the 
singular <e.g. halves, etc.); otherwise, the process is 
the same as in step 12. 

10) »«i" 

an exception list containing non transformable words 
ending in "ies" (e.g. series, etc .) is consulted. If 
the word is not found, the singular is formed by, 
dropping the "ies** and adding "y" (e.g. activities,' 
etc. ) . 

11) "s" 

the fourth to the last character must be examined- 
before a decision can be made* (next section) . 

12) "all other letters" , 

an exception list is consulted for irregularly formed 
singulars whose plurals end ip "es" {e*g. indices, 
etc.) . If the word is a member of this list, the 
singular is returned from an exceptions dictionary. If 
not present, the singular is formed by dropping the 
final "es" (e.g. zeroes, etc.). 



The fourth to*the last character is: 
13) "e","y" 

the singular is faraed by dropping the final "es" and 
adding "is" (e.g. theses^ analyses^ etc. ) . 
lU) "s" 

the word is transf oraacle^ but an exception list is 
examined for those plurals whose singulars are formed 
by dropping the final "ses" (e. g, busses^ etc.). Words 
not on this list' are transfotaed by dropping the final 
"es" (e.g. stresses, masses, etc.). 
t5) "all other letters" - 

the word is transfornahle,* but an exception list is 
consulted _ for those wo-ds ending in "ses" for which 
singulars are f oraed by dropping the final "es" 
(e.g. thesauutisesif choruses, etc.); otherwise, the 
singular is foraed by dropping the final ''s" 
(e.g. cases, uses,* etc.). 

The algorithm has performed well on a largo nuabor of 
data bases requiring exceptionally short exception lists. 
The lists were cumulatively 'gathered after processing 
several large title data bases, our experience has shown 
that the word transf craation routine coded in PL/I for an 
IBM , 360 aodel 75,.. sucessfuHy singularizes all plurals 
ending in "s" at a rate of 50 per secon4 when applied to a 
title data base containing 5% transf oraable plural words. 

The resulting plural words and their recorded singulars 
can be used to gather these similar concepts under a single 
access point in an index by. several* means: 1) alter the data 
base being indexed by replacing all transformable plurals 
with their respective singulars, or 2) with a "preferred 
word", replace the occurrence of both the singular and 
plural forms of^ transf or aable plurals. The . first 
alternative can be easily implomentod as part of the word 
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transfoxmation routine^ altering the d^t-a^ base as a 
.transfer aable plural Is found. for generalized stea- 
recoding^'algorithniSf however^ this practice may lead a user 
astray tlirough the cnission of gra.maatical information, 
v^ith a properly chosen "preferred word" giving some clue to 
the original grannat leal construction^ a use^can generally 
reconstruct the appropriate suffix. 

Polloving this second approach^ the word transformation 
routine creates an authority list consisting of a "preferred 
word" for each plural-singular word pair found in the data 
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Figure 6.3 A portion of an autoiiiat ically generated 
authority list produced by the plural-singular 
steaai^ncr-recoding algoritho 



base. The "preferred word" is a non-specific entity vhich 

0 ^ 

consists of the singular word followed by the plural ending 
enclosed in parentheses. Figure. 6.3 depicts a portion of an 
authority list, produced by the word transformation routine. 

The authority - list is . utilized during index 
construction (see Figure 4.5 and* Figure 5.1) to eliminate 
/nf lect ional scatterirfg^ Each significant word in the title 
or phrase being examined is checked against the list of 
singular and plural. words on the authority list. Whenever a 
•atch occurs, .the actual word appearing in the context is 
replaced by the preferred non-specific index word located in 
the a-uthority list. The graamatical information recorded in 
the suffix is not altered if the ...word appears in some 
functional location ether than a potential main term. 
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Figure 6.tt Reduced scattering in a DKWIC index as 
a result of applying an automatically generated, 
authority list to words of rain terms (coiapare 
Figure 6.2) 
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The results obtained using sqch an authority list 
duriaj the creation of a double-KWC coordinate index are 
illustrated in Figure 6.4 where the entries which were' 
scattered in the prototype index {see Figure 6.2) are now 
merged under a single non-specific oain term. 
6.2. Synonynal Scattering 

To the indexing specialist, the thesaurus has long been 
a useful device. Prinarily constructed for vocabulary 
Doraalization, the thesaurus is a prescriptive indexing aid 
which provides a single preferred wbrd-^fora for synpnyas and 
near--syncnyas, and for words occurring in^ various 
inflections if inflectional scattering has not been 
resolved. 

Since nachines are very adept at catching words^ 
synonyial "scattering is easily "eliainated ^by automating the 
thesaurus lookup procedure. Artandi - {Artandi,68) has 
outlined a well-^foried procedure of automatic vocabulary 
normalization for bock indexing. Once a keyword has been 
identified, it is subject ^ to' a matching operation in the. 
thesaurus. * match -signals the replacement of the original 
keyword with the preferred word supplied b^ the thesaurus, 

Artandi « s approach^ applied to natural language 
indexing, though normalizing the vocabulary and thn^ 
reducing syncnyaal scattering^ ' reshapes the index into 
predetermined categories. Any connotation or saggestiveness 
supplied by the replac€d word has been lost. A complete 



change in aieaning could possibly result if sir.qle vords are 
replaced by synonyms in title phrase,, A KHIC index of 
such phrases would inadvertently lead a uset astray. 

HighcocK {Highcock^ 68} has demonstrated the inclusion 
of synonymal pointers within KWIC indexes in che form of* 
"see also" cross' references. Any synonyas are included' 
manually as part of the data Kase be.ing indexed. The KWIC 
indexing algorithm appropriately selects all the keywords in 
the "see also" cross reference, placing them within the 
collections cf like teriDS (see Figure 6.5) . . 

LASERS AND LASER BATEFIALS. = 
AND ADVANTAGES OF KAZZONI PROCESS. = ♦KG OP SCAPS. OUTLINE 
HOT BELT ADHESI7BS IN EUROPE. = i 
R« = HOT MELT APPLIES LAYS DOWN DOT OR SPRAY PATTB 

PHILICITY ON TRANS BEHEE*NE EFFLUX. = ♦ OF INCREASING NUCLEO 
♦ATOMY OF THE CELL BEMERANE. THE PHYSICAL • STATE OP WATER IN* 
Ili SEE A* SEE ALSO BEKERANES SEE ALSO KEPATIN SEE ALSO PROTE 
IN SEE A* SEE ALSO MEMBRANES SEE ALSO KERATIN SEE- ALSO PPOTE 
♦DDCTION OF POROUS BEBERANES FOR BATTERIES AND FUEL CELLS. =• 
IPIDS OF EACTERIAL MEMBPANES. = , ' • " l 

MERCURY SEE ALSO ELECTROCHEMISTRY. = 
MERCURY SEE ALSO ELECTROCHEMISTRY. 
EUCTURES OF LIQUID MERCURY AND LIQUID ALUMINIUM. = S 
Y ADSORBEC tCHS ON MERCURY. = ♦SURFACE EXCESS OF SPECIFICALL 
SORPTION IN SODIUM METABORATE SOLUTION. ^ ULTRASONIC AS 

METAL SEE ALSO ORG ANOM ETALL IC . = 
METAL SEE ALSO ORG ANOMETALLIC . „ = 
ANEOUS TOXICITY OF METAL COMPOUNDS." = - PERCUT 

POLAR MOLEC'JIES CN, METAL OXIDE SINGLE CRYSTALS. = ♦RPTION OP 

Figure 6.5 Synonymal pointers found in a KHIC 
index as "see also" cross references 

— '. 



Automated "see- also" referencing combines the two 
approaches mentioned above. As keywords arc identlfiel 
daring the indexing process, matches would be ' reco'rded with 



tnes'aurus entries. l\e termination of the normal keyworl 
selection phase would signal an inspection of the thesaurus. 
A »*see alsc»* reference would be generated for each tern 
whose related term also appeared in the data indexed. 

••See also" cross references alert the user to synonyms 
present in the index but do not alter the ordering of actual 
index terms. The user_. is* forced to perform" this 
restructuring ^ by following the synonyaal -pointers and 
examining those related entries. 



; BBGEHERATE— 
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Figure .'6. 6 Vocabulary normalisation in a PANDEX 
index collating preferred words hut not'altering 
the original text 
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An approach eaployed by CCM in the construction of * .the 
PANDEX index allows the index terms to be collected under a 
smqle access pointi" All niain keywords -are subject to the 
normalization of a t hesaurus. Collation of. the index 
ejitries are performed' first on" the nprtfialized preferred 
word/ which is*^ printed as the oain teM, followed by the 
secondary term. The main and subordinate keyword^ are 
printed in boldface within the context Mithout_al'teration^ 
Consequently, synonyms ^appear grouped beneatl^ a 'preferred 
word while the original ^ text of the title phrase is 
preserved (see Figure 6»6). 
6.3. Are Titles. Sufficient? ^ 

ft 

The advent of KWIC and other c^oaputer-generated title 
indexes has caused much concern over the adequacy of titles 
as the sole source of indexing information. Titles are 
being' utilized under the general assumption that • there is a 
positive correlation between the title and content' of the 
article* « » . 

Specific studies of title adequacy for particular 
journals or fields* have produced varying' results. By^ 

comparing the subject entries in Physics Abstracts .with 

' ♦ 

words appearing in the titles of selected articles. Maizell 
f?laizell,60}' found that 69Ji of the .entries for these papersu 
were directly derived from title words. Rnhl {nuhl,64| 
found that between SOX and "^0% of author-prepared titles did 
fully^ r^lect those index terms ass^igned by human indexers% 



The' variations • observed reflected different, subject fields 
examined, the more specijEic the subject area, the better the 
title; . • ' 

^ Janas)t43 ' {Janaske,62) ha1p*,identif ied two , distinct types 
of factors which contribute" to the difficulties of using 
"titles as source^ of indexing information: 1) the' language' ' 
habits, background, interests,' and idiosyncracies of the 

author r 2) the interestsr"' familiarity vith the subject, 

* \ . ♦ 
language habits, ' iaaginatton, and ^id iosyncracieso of the 

user. The witty, punning^ deliberately \non-inf ormative or • 

so called "pathological title" falls into this first 

category as well as the use of unfaailiar acronyms. The 

critical problem of* bringing the user and indexer vocabulary 

into coincidence is the subject of the second' category. 

Here, the search^ is forced to anticipate the terminology 

used by a large number of index^rs ^.<i«e.. aiithors).^ Words 

similar only in spelling but describing different concepts 

or applications are grouped together. The sain^e concepts may 



|)'e expressed in gaite ^iffe^^ent phraseology depending on the 
author's, rather than. the user's, are,a of speciadizatipn. , 

Kennedy {Kennedy, 63} has stressed that author 
participation in writing gobd^titles is essential in thi^ 
age of derivative indexing. In his suggestions- to authors^ 
he recoamends: 

1. consideration of the title as a one sentence abstract # 

2. use specific terms 

3. provisions of enough context to clarify th^ relationships 



fe'etH.een /keywords', but" no more than necessary - ' 
balance of brevity and descriptive accuracy 

5. when possible, use words instead of nofations 

6. frliag subjects in relation ' to titles to introduce 

' general concepts in wocd indexes. 

Herner JHerner,e3) has aapped the ^effect of author 
participation from yet another, ultimately aore crucial 
direction. He has reported a significant increase iq^ the 
average number of keywords per title taken from articles 
appearing in the ADI and ASIS proceedings of the . last^^"^ 

decade. \Hcre recently, Tocatlian .{Tocatl? ^n, 70) has 

\ r 
suggested th^t the guality/of<' titles 'used for articles in 

Chemistry has improved immeasurably since the widespread use 

of. KHIC title indexes in jChemical Titles and other secondary 

publications. If these results are universal, the prognosis 

for titles ^s indexing sources is well founded. 

Title enrichment offers anoth^er .possibility ' for 

improving- the effectiveness of titles alone. Pre-editing 

and augmentation of titles has been a common practice of 

,1«a&y KWIC users. -The-added €Ost and required human analysis 

necessary to " choose title enriching tet«s defeats,, the 

purpose of pure derivative indexing technique's. However, 

authors subnitting articles to so»e journal publishers are 

required to supply pertinent "keywords" as well as an 

, \ ' .^^ .' ■ ' ■ ' ' 

informa'tive title* an^abs^ract. Including" these enrichment 

terns with the-^ title is a small price to pay foe more 

effective 'retrieval. ^ 



CHAPTER VII* EVCLUTICN OF THE K,BIC-DKWIC HYBRID SYSTEM FOR 
AUTOMATING AMT SELECTION IH THE DKWIC INDEXING 
SYSTEMS 



TKe index provides the primary o-athvay throuqh whicfh 

^the researcher threading thr cue maze of published 

literature retrieves his quarry. The satisfaction of 

isuccess or the frostratioa of fa^ilure from his wanderings 

^reflect the properties of his oap^ ^xhe index. 

The previous chapters have described and illustrated 
\ ' ' - \ " • ^ 

how . the double-KHIC coordinate indexing technique 

facilitates access to '<:he infojrmation provided by titles at 

an increased level of ^ecificity over other comparable 

automated indexing techniques. DKWIC, like all of these 

autoaatic indexing techniques, includes some op^arations 

which require the i^htervention of an, index analyst.^ This 

chapter focuses attention on these huaan. operations and 

discusses methods af ainiaizing or elialnating the need for 

scae .of thea. . . ' ^ 

7 • 1 • Ha gn i t u de o f t h g „ Human' In t et f a c g Requirements for^The 

DK grc ^ Ind exing Opera tion s 

An examination of the DKWIC construction techniques 

reveals three areas where an index^analyst interface is 

required. The first is to^ determine the words which have to 

appear on the stoplists (sectior>s 3.2.1, ^#3, U,5, and 5»2), 

The main term stoplist governs the quality of the main index 

terms and, to a great extent, th^ size of the ensuing index, 

.95 



Potential main terns (PMTs) beginning vith-;.^ vords on this 
stoplist are not generated^ thus precluding them even from 
consideration, in later main term selection phases* k we,ll 
constructed main term stoplist enables the analyst to reject 
uniapprtant access points^ and improve the overall quality 
of the indejc while reducing its size. The cost of excluding 
a word from the main term stoplist should exert only a minor 
influence on judging a word's significance. The subordinate 
term stoplist^ tdo^ influences the quality and size of the 
index. In the construction schemes previously described, 
the subordinate stoplist is the sole determinant of the 
quali^ty of subordinate terms in permuted DKWIC index 
ent rie'fe . In - addition to prepositions, conjunctions, and 
articles, other words of extraordinarily low information 
content (e.g., seme,, any, etc.) should be placed on this 
list. By including as^ few as twenty-five words on this 
list, the nuaber of subordinate terms generated can be 
reduced by as much as 40%, with, a comparable' reduction in 
the overall size of the index, and considerable improvement 
in the quality of both main and subordinate .index terms. 
Such a small subordinate term stoplist is made possible by a 
quantitative context measurement • which permits all words 
haying less than a specified number of characters- to be 
included as members of the list. For a new subject area, 
the production of stoflists can be greatly eased by the 
generation of a trial index to determine the vocabulary of 



the data base. Once the stoplists have been created for a 
particular subject area they can be used repeatedly with 
only periodic updating. 

The second operation r^?quiring the, attention of the 
index analyst concerns the maintenance of the singular- 
plural exception lists (section 6.1.2) for the vocabulary 
normalization procedures which have been shown to be an 
important tool for improving, the quality of the index. 
These exception lists, which are required by the automatic 
"^depluralizing algorithm for eliminating, inflectional/ 
scattering of main index terms, are less data dependent than 
stoplists but will require updating as ne_w data bases are 
encountered. 

The third and most critical operation requiring human 
intervention involves selection of the actual main terms 
(AMTs) which are to appear in the final index (section 5.3)* 
These AHTs have to te selected^from the PBT list generated 
from the particular collection of titles being indexed. The 
selection procedure is claud^d by the subjectivity involved ^ 
in determining the "worth" of a collection of main terms, a 
judgment weighted bcth by economic considerations (size of 
the index) and the requirement to "cover" the titles being 
indexed. An index is said to coyer a collection of titles 
if there exists at least o;ie actual main term . (AMT) 
beginning with each significant word of each, title of th^ 
collection. Similarly, the set of titles covered by a main 



tera is that subset of the title collection containing that 
main tera. The remainder of't'his chapter deals exclusively 
with the pcctlems scrrounding AflT selection and culminatejs 
with a solution for autcaating this highly subjective manual 
phase of the DKtfIC indexing operations. 
7 » 2 • Kxaainat ion of t he AMT Select io n Pr ocessjss 

As suggested in the j)receding section^ an index 
analyst's priaary ccncern in the AMT selection process is 
production of a covering index. However, he' may be 
influenced by cost considerations to choose less appropriate 
actual aain teras* In order tq^ clarify this discussion of 
the AMI selection process and its raaif icat iops^ soae 
notation aust first be intrt>duced. 

Let A represent a potential main terip and COVER (A), 
denote th'e set of titles covered by A. FIRST (A) syabolizes 
the ^irst word of the phrase A. . 

Now^ consider the following typical selection decision.' 
Potential main* t'era A is considered an iaportant choice for 
inclusion in the final index since it. singles out a 
significant^ specific phrase cooaon to a collection of 
docuaents. Because the index aust cover the titles 
subaitted, other aain teras beginning with FIRST (A) may have 
to be chosen.. In aany cases^ selections cannot be made 
without adding unnecessary redundancy to the final index. 
The potential afiin tera FIRST (A) = FIRST (B)^ aay have to 
be chosen to ccaplete the covering bujt COVER (A) is totally 



subsumed by COVER (B) . Consequently^ in ar effort to reduce 
the size of the index, tera B is chosen over term A even 
though the latter is presuaably an iaportarit access point. 

The aethcd employed by the analyst iji choosing these 
entries is facilitated by the printed potential aain term 
statistics (Figure 5.6), vhich provide an indication ofv the 
size of each FMT's covering set, and the assumption that the 
covering sets for PMTs having the same- first word and the 
saae- nuaber of words in the PMT phrase ajce autually 
exclusive (i.e., a single title does not^ contain both 
"INFORMATION CONTROL" and "INFORHATION SYSTEM") , an 
assuaption which is not always valid. The sua of the 
covering set sizes for two- word aain teras can. then: be 
coapared with, the size of the^ covering set for the 
corresponding single-word aain t'era to estiaate the overlap 
produced by selection of the two-word aain teras. For high- 
density PMTs, this process can be extreaely di^f icult to 
perfora. Even with care, the selections produce 
considerable redundancy of entries, and a proportional 

increase ih the size and cost of the index. Furthermore, 

ft 

the selection -probleas are coapounded when aain tera phrases 
having aore than two nord^ are introduced. 

7*3. AM I Selection A lqp ritha s fo r , Minimizing. Index Size and 
Cost ^ ^ ^ 

The size and cost factors influencing the selections 

made ^ by the index analyst can be minimized by restructuring 
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the selection algoritha to allow exclusive set selection 
from all km covering sets. That is, if ••INFOKMATIOM" and 
"INFORMATIOH COH.TB0L" are both chosen as actual aain terws, 
then the selection algorithm^ must insure that ill titles 
containing the latter multi-vord tersi are excluded from 
postings under the sinqle-Word tern. 

This selection burden could be passed to the iridex 
analyst by alloifing him the capability to edit subordinate 
phrases, through a selection ' procedure vhich vould b^ 
executed in two steps^ .J 

1) From the Ff!T lists, the analyst' would first choose 
the desired A!!Ts, neglecting for the tiae being any 
overlapping covering sets. \ 

2) An AMT list with appropriate subordinate entry 
accession ' codes would be prepared, from which the 
analyst could eliainate those overlapping entries which 
were to be excluded from the final index. 

Additionally, the analyst could perform finer selections at 
the subordinate entry level by choosing actual subordinate 
entries (A3Es) froa each coveri'ng set of potential 
subordinate entries (PSBs) . However, this additional task, 
which the analyst would have to perform manually, would make 
the selection processes an even greater chore than at 
-present* particularly for large indexes. 

At least the process of qeneratxng exclusive covering 
sets could OP relegated to automatic procedures. Let us 
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consider the knT selection froa groups of PHTs» First, the 
P?!Ts would be segmented into autually exclusive groups whose 
aeabership is determined by th^ first word of the PKT. 
-Then> - potential subordinate' entries belonging tO\ the 
covering set of each filial, AST in the group would b 
subject to set - inter sec^ioa^^with its parent AST covering 
set. The actual subordinate entries associated with an AMT 
would include all PSEs not found in the intersections^ with 
its ^of f springs. 

Before tKis approach is detailed, let us exaaine a'ore 
carefully the structure of a PMT group* Figure 7.1 displays 
a typical PMT group which contains several distinct two-word 



and three-word potential iain teras/ Suppose that froa this 
group the teraiS^ ••INFORMATION"; "INFORMATION PROCESSING", 
"INFORHATION PBOCESSING COHTHOL", INFORMATION SCIENCE 

Seal Preg. FMT 
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Figure 7.1 a potential aain tera group consisting 
of all PMTs which begin with the same word (see 
text) 
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FSOGRAKS", and "INFCFMATION RETRIEVAL" were Chosen as ASTs. 
These • AHTs can be arranged in a dependent sequerce 
reprfssented fcy a tree structure as shown in Figure *7..2. 



INFORMATION 
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♦ PROCESSING I * RETRIEVAL 

I I ^ 

I ^ L_ , 

I * SCIENCE PROGPAHS 



♦ ♦ CCNTECL 



Figure 7.2 An ifll tree chosen from the PMT group 
^af Figure 7. 1 



In order, to disctiss_the relationships aaong elements of 
this tree structure, let us^^define^oae^usef^ terminology. 
Let T be a directed tree with nodes ft<0>/tcr>7rrTt<nX)ir^ 
root element t<0>, and branches fb<0>,h<1>,.. .b<n-r>) . A 
directed tree has the property that each no^e, except the 
root node, has one and only one branch directed to it. As a 
conseguence, the bcanch-node relationship defines a' 
successor function, S(t<i>), on the nodes of T such that 
t<j> is an element of S{t<i>) if and only if a branch of the, 
tree is directed, from node t<r> to node t<j>. The successor 
relationship models the dependency found in AMTtrees. The 
successor function generates filial sets, of nodes, S {t<i>) ^ 
{t<i<1>>,t<t<2>>, . . . , t<i<m»} , and nodes ha ving empty 



successor sets are call€d teminal* 

In the AMT group cited in Figure 7*2/ the root of the 
tree is signaled by the single-word main term IHPORMATIQN. 
Ihe successors of the root . element are' signified by 
S(INFCPHATION) ^ { ♦ PROCESSIHG, * SCIENCE PROGRAMS, * 
RETRIEVAL). Except for ♦ PROCESSISG'i the elenents of this 
set are terpinal*- The sole successor of * PRdCESSIHG is Sfj[* 
PROCESSIMG) ■= {* * CCNIECL) . ' . | 

Each of' the AHTs chosen froi a, PHT group contaia. 
.possibly overlapping covering sets of PSEs. Ah algorithm 
for reducing these overlapping PSB sets to. lutually 
exclusive ?SE sets can. be described, employing the tree 
structure terainology introduced above., 

[1) Starting vith the root element of an AMT group, form 
4he union of all FSEs associated Mith each node of the 
successor of the root eleaent. The exclusive PSBs of 
the root element are the PSEs remaining after deletion 
of the PSE elements in the above union from the -total 

i 

set of PSEs assigned to the root element. If the root- 
element exclusive PSE set is empty, the actual main 
term is not selected. 

2) Let each element" of the filial set, S(t) , act as the 
root element cf an AHT subtree and perform the 
<^eration defined' in 1) for each of these .elements. 
TheN^rier in which the exclusive PSEs are selected is 
important. ^Fxom the PSEs of the root element, all PSEs of 



the root's successocs aust be excluded and not' just the 
exclusive PSEs of the toot's successors. The algbrithn may- 
be stated syatclically in the rectfrsive procedure' beloK. 



- SELECTEE«(T) • . 

1. 7XT>*= P<T> - P<S(T)> 
r->2. a =■ next element of S (T) ; nc more, return 
• — -3. 3ELECTERM {^?) . . , 

where ' ^ . . 

P<T> <!esignates the total PSEs assigned to node T 

Z<T> designates the exclusi?je PSEs assigned by the function 

"iSELECTBBIfw to nod§ T . 
S (T) designates the Set of successors to node T , 

The function SELECTEBW operates on an entire tree and 
is. activated by an initial call SFLECTERM (ROOT) where HOOT 
is the root' of an AMI group. 

The example AMT' group described by Figure 7.2 r^guire55 
that the PSEs of ♦ FFpCESSIMG^ ♦ SCIEHCE PROGRAHS, and * 
BETRIEVAl be collected before the exclusive PSEs of 
IHPORHATION can be determined. To perform this implied 
order of operations oif the PHT file, najor modifications of 
the earlier ..operations would -be required. Either two 
distinct, passes over a sequential\ PHT file would be^ needed', 
or *.each PIT record would have to be directly accessible. 
An<fther significant point ^tha.t must .be taken into 
consideration is the number of set exclusions necessary to 
compute the function SELECTERM,. ^ Ever* the most sophisticated 
algorithms for performing set intersections <or exclusions) 
require extensive searching of possibly len^ithly lists. 
Should it net be possible to carry out these searches in 



primary memory, the ecst *of direct-access secon.lary storage 
access would probably be prohibitive. These considerations 
led to reexaminatian .of the' approach used for PMT 
generation. ' ' • ^ . • 

7.4. Influence o f t he^RflT Gen^ration^ P roces s on AMT 

In essence, algorithms for deleting the overlaps caused 
by non-exclusive PS]b covering sets ajssociated with elements 
of an MIT group (see preceding section) would require 
elimination c£ PSEs which initially had to be created and 
manipulated in some earlier stage of . processing. 
Cohseguently, il the selection algorithms described in the 
last * section * were to be implemented, the double costs of 
generating and deleting suljordinate entries must oe borne. 
Therefore, a reexamination of the methods for PMT generation 
(section 5. 1 and 5*2) was warranted. 

The manual procedures by which an index analyst chooses ' 
actual main terms of. an index appear to be weighted by the 
number of titles covered by a particular PMT if it were 
chosen. The reasons for basing the, choice <^f AHT sel^ection 
on occurrence statistics is well founded. * The more often a 
phrase Ci*e., multi-wcrd terra) occurs in a corpus of 
documents, the more important this phrase must be. In fact, 
this , was the reasonjwhy automatic generation of multi-word 
iftain terms seemed so attractive a possibility for increasing 
index inq^ depth. The statistical information presented to 



t 

t 

the analyst cy the P?li listings helps him to tail'br the AJJT 

qcoup on the basis of the occurrence statistics inherent in 

the. data itself • . The statistical data would be more useful 

if it re,ferr€d to non-ov^rlappinq covering sets* 

7*4.U A Process for Genera tin q Exclusive _PSE , 
'(Pol€ ntia r_ SubQrdinate Entry) ,.get;.s ^ 

A closer examination of the PflT qrdup depicted in 
Figure 7*1 reveals a tree in left-tlist fbrm whose nodes are 
PMT entfies/ The huaeric- qjuantities listed beside each PfIT 
indicate the nunber. of^jbit les in the^ PMT file that contain 
the extracted potential Main tera. this nu^mber is ' always, 
greater than or equal to the sua of the occurrences of each 
covering set aember. Froa the statistical information ac- 
companying the PMT qroup^ the size^of ^e exclusive; PSE sets 
for each node can be easily calculated, {though not' manipu- 
lated as stated before) • . The frequency count of each termi- 
nal node is a reflection of th^ exclusive PSE set containing 
this PMT. The size "^cf the exclusive sets of non-terminal 
^nodes can be calculated from the function given below. 

Let P<t>r t an element of he tHe sizte of the total 
PSE set and Z<t> be the exclusive' PSE set associated with 
node t. Then/ . Z^t> = P<t> - P<S (t) > 

where P<S (t) > is the total PSE set of all thfe filial - nodes, 
of t. Figure 7.3 displays the PMT tree of . Figure 7.1 in 
another, fotnat wi:th values of *^P and 2 for each node of the 
tree. 
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Figure 7i3 The FHT tree - for the PMT group£ of 
Figure 7.1 showing values for«rtotal PSE set^ (P) 
and exclusive PSB sets (Z) for all the nodes 



The exclusivity of potential subordinate entry sets is 
intinately linked to the potential main teras which swere 
extracted frc.ii the eleaents of- these sets.. The KMOC-DKWIC 
generation process,, creates these sets only for those PMT.s 
which ar'e terainal nodes of a PMT tree. Each non-terminal 
node forms a root node of a PlflT subtree and the PSE 5et 
contains the terms of all successor nodes as well as terms 
pertaining exclusively to- this node. The PMT associated 
with each of these exclusive sets can be distinguished 
during FMT genetcltion since either the maximua size PMT 
(specified by a user input parameter), had been gen^fated 
from this position in the title- or a terminating breaK 



character Was. found iiffedia-lrely folloVing the PMT. 

Let us assume that the PMT generation process^ creates 
only these types of entries*, Can the useful PIT lists used 
for the prototype DKillC ^od^ be generated? ' Figure 7»U 
displays .the. terminal FMT statistics, Z<t>, t hat^. wcjuld be. 
generated frca the ||craal PMT list of Figure 7.1;. By 
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Figure 7.4 Terminal PMT statistics, Z<t>, for the 
PMT grdup of Figure 7.1. P<t> represents the 
noraai EMT st^itistics presented in Figure 7. r. 



rearranging the expression, .for ,Z<t>, the normal PMT 
statistics; P<t>, can be calculated. - . ' ■ 

- • P<t> - Z<t> + P<S(t)X 
The calculation is straightforward^ though recursive^ The 
implications for a selection, algoritha^ however, are not so 
simple* Unless all the PSEs fro© a giv^n terminal PMT entry 
are chosen for . tl^^fi.naX index, the PSEs not chosen will 
have to be ao^Jified to, conform to a chosen A.MT* To' perform 
the wolif ication of bo^th the *»ain term and subordinate term 
entries, sone new terminology must be introduced which 
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describes the generation of terminal PMTs and their PSEs. 

This is developed in the next section* - 

<• ' 

^> ' 7. U. 2. Max imal iMain Teras (HHTs^^ and .SBgciticity Dnits 
To iap^lement the processes described -in the last 
section, a restricted setTof PflTs. to be generated, which are 
all ter'iinalr will be called aaxiaal' -aaln terms (MKTs), 
•Maxiraal aain terms are constructed from a title in segments, 
called specific ity units* The specificity of a main term is 
the number of specificity units contained therein. If a 
maximal main term tequires \ alteration during the AMY 
selection process^ it is modified from one of higher 
specificity (i»e* haying greater number of specificity 
units) to one of lower specificity by the. .deletion of 
specificity units from the MMI moving right to. left. 

Specificity units are defined formally in two classes: 

1) any «ord not appearing on the^ primary stoplist; 

\* « 

2) the shortest contiguous sti;ing of words delimited on* 
the left by another specificity unit and ending with a 
word that is not a member of the secondary stoplist. 

Figure 7V5 illustrates the specificity units foui^rl in dJ/" 
particular title* ' 

Combining the definition of specificity units with the 
.previous def irii tions » f or potential main terms (section 5.2), 
a maximal main term has the following characteristics: 

a) the' first word of a ?1NT is a type 1 specificity 

unit; 
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Title 

THE 'retrieval OF INFCBMATION B? AUTOHAT^.D SYSTEMS: A SUBVE.Y 

Spec if i cit Y.Dnits - ' 

Type 1 

RETRIEVAL 
INFOBMATICN 
AUTOMATED 
SOEVEY 

lype 2 

• OF INFOR*JATION ;> . 
BY AOTOHATED 

SYSTEMS 
A SOBVBY 

Figure 7,5 The sp.ecificity units qenerated from a 
title. The word "SYSTEMS" appeared on the primary 
stoplist and the words "THE", "OF", "EY", and, "A" 

• appeared on the subordinate stoplist 



b) contiguous specificity units of type 2 are contained 
in the MM? as long as a aaximaa specificity (supplied 
through a user input parameter) has not been surpassed, 
or teminating punctuation, has not been found while 
attempting to construct* the next specificity iinit. 
The aaxiaal aain teras that can be constructed from the 
title illustrated in Figure 7,5 are displayed in Figure 7,6. 
Typically the number of MMTs found in a title is equal to 
the number of significant words found therein. The speci- 
ficity of each of these MMls is dependent upon, one of three 
factors: the input ptarameter indicating the maximum speci- 
ficity the terminating punctuation; or the length of the 
title (if no* ^-erminati'ng punctuation is used). The total 
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; RETRIEVAL OF INPCRilATION^ BY AUTOMATED 3 

INFCBMAIION BY ADTGHATBD SYSTEMS 3 

AUTOMATE!: SYSTEMS 2 ' . 

J SURVEY 1 

Figure "7.6 The maximal main teras formed from the 
specificity units illustrated in Figure 7.5 

number bf PMTs that could be generated from a given title is 
the sum of the specificities of HMTs generated from the same 
title. In the example abovo, ninePHTs vould have been 
generated whereas only four MMTs, Assuming that a computer 
record of the, type ^ illustrated in Figures 5.3 and 5.5 is 
constructed . for each EMT or MHT, then, in. this example 
alone, less than , half of the records generated for index 
^production with PMTs would have to be generated with KMTs. 
7 . 5 An AHT S ele cti on Al g orithm 

Each MMT generated as above produces exactly one AMT in 
the final index such that a covering index must result. The 
selection procedure thus reduces to choosing the proper* 
specificity for all AMTs from the MMTs generated. Again, we 
refer to the organization of the MMT groups to describe a 
method of manipulating these terms, and according to the 
definitions in section U*2, the terminal PMT group of Figure 
7.4 can now te^ looked upon as such an MMT group. 

The MMTs can be segmented into groups in a fashion 
similar to the PMTs, membership being determinei by the 



112 



initial specificity urit. The MMT group is agair organized 
as a tree in left-list form, though many intermediate' nodes 
of the corresponding fMT tree ma/ be absent since all 
,ements are tetmiflal. Note, for example, the absence of 
•fTNF6R^!AlIG.M^SCISNCE" as an MHT in Figure 7. a which was 
present as a PST\in ^Ttgure 7.1.. However, all the 
ilforaa'tion is present in the l!tt-T^ tree to, construct the PMT 
t^ee of Figure 7.3. 

\ The actual specificity of each of the AMT selected or 
generated fro» a MMl group must be determined. Since it 
\ifouid be quite a chore to input that information for each 
entry, the folld^ning set of default AMT specificities have 
been designated which Kay be overridden by an index analyst. 
\l) The specificity of the first AMT of the group 

2)\ The specificity of .the next AMT of the group is 

\ 

the liniiua of the specificity chosen for the 
present entry and the H^T specificity of the next 
MMT, 

Because of the second rule, few override cooaands need 
to be applied per MBT group. In, order to create the AMT 
specijfied lin Figure 7.2 frotn the MMT group of Tiguro 7.4, 
the pverriie commands displayed in Figure 7.7 would be 
neces:?ary. , Note how the remainder of the specificity 

tailoijing would be handled by the default specificity rules. 

\ 
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Override SMT 

Co ©Bands Se^jf HHT 

IMPORHAJIC.N 
IHFORHATICN CONTROL 
INFQRBATICN CONTROL BY AUTOMATED 
INFO RHATI CN DISSEMINATION 
INFORMATION DISSEMINATION TO SCIENCE 
INFOBMATICN^ PROCESSING 
I NFOBHA TION PRiQCES SING CONTROL 
INF ORHATI CN -P ROCESSI NG UTILITY 
INFORMATION SCIENCE^PROGRAMS 
INFORHATICN, .RETHIE.VAL 

Figure 7,7 The selection override commands 
necessary to fori the AMT selections illustrated 
in Figure 7*2 frci the MMT group in Figure 7.a. 
The ccisands are ordered pairs of numbers 
signifying the seguence number of the >!MT to alter 
and the desired AMT specificity? The underlined 
tens depict the AHTs selected. 



7.6 Au tomatin g the AMT Selection Process 

If the index analyst determines the actual main terms 
strictly by the freguency of occurrence of distinct concepts 
found in MHT groups, then the selection process itself 
becomes a candidate for automation {Belzer, 71, Carroll, 69) • 
Reasoning that an AHT of higher specificity Is chosen ov^r a 
less specific one because the —less specific entry would 
cover too many titles, a selection algorithm can be 
determined. 

Let us assume that an upper limit is imposed on the 
number of titles to be covered -by an AMT. If *His limit is 

exceeded^ then AMTs will be sought at the next higher level 

> 

of specificity. At this higher level, AMTs will be choser. 
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only if the nunber of titles covered by these terms meets 
soie minimuii criteria* Of course , any HMTs of lover 
specificity bypassed while selecting a jaore specific AMT 
vill also be chosen as an AflT at the current specificity to 
fflaintain covering^ Ihe basic idea is_ to select AMTs 
covering approxiaiately an equivalent number of titles while 
selecting, when possible, the mpst specific ANTs from the 
BMT group covering the. titles. The algorithm^ described 
•ore formally in Figure 7»8, examines the PMT tree generated 
from an Mil group and attempts to prune nodes so that the 
titles covered exclusively by each node fall between the 
values niN and MAX* 



SELICT. 

P<T> > MAX ^ 
I * 2. Select Z<T> PSB whose XMT- is the 
I specificity of T 

I r->3* R = next element of S (T) :no more, return 
I SELECT <R) 

« >5. P<T> < HIM 

• K 

I 6* Generate P<T> PSE, whose AMT is one less than 
I the specificity of T : return 

* >7. Select P<T> PSE whose AHT is the 

specificity of T , : return 

where P<T> and Z'<T> are, respectively, the number 
of ,total PSE and exclusive PSE of the node T, and 
S (T) is the set of successor nodes of ?• 

Figure 7.8 The Icqical flow for an automated main 
term selection -pxocess^ '" 



The algorithm is called initially with the root element 
of a PMT tree and prunes all subtrees fcfund therein* 
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Assuming* that KAX is, tt and nm is 2, the results of applying 
the alqorithB to, the EBT tree of Figure 7,3 is displayed . in 
Figure 7, 9; The actual main, terms automatically selected 
from the PMT tree of Figure' 7, 3 are summarized in Figure 
7.10. 



1;1 PdNFORBATICN) = 25 

1.2 select H PSE whose AHT is INFOHHAJION 

1.3 * CONTROL next eleient of S (INFORHATION) 

2.1 P(* GOMTEOL) =5 

2.2 select a PSE whosis AHt is INFOftMATION CONTROL 

2.3 * * BT AOTOHATED next eleiient of' S(* CONTROL)" 
3,1 P(* * BY AOTOHATED) =1 

3.6 generate '1 PSE whose AMT is INFORHATION CONTROL 
2,3 no iore eleaents in S (* CONTROL) ' 

1,3 * DISSEMINATION next eleient of S (INFOHHATiON) 
2,1 P(*- BtlSSEHINATION) =2 

2,1 select 2 PSE whose AMI is INFORMATION DISSERINATION 
1,3 * PROCESSING next element of S (INFORM ATTOJl) 

2.1 P(* PROCESSfNG) =5 

2.2 select 2 PSE whose AMT is INFORMATION PROCESSING 

2.3 * * CONTROL next element of S(* PROCESSING) 

3,1 P(* * CONTROL) = 2 . . 

3.7 select . 2 PSE whose AMT is I))FORMATION PROCESSING 
CONTROL ; 

2,3 * * OTILITT next eleient of S(* PROCESSING) 
3,1 P(* * UTILITY) =1 

3.6 generate 1 PSE whose AMT is INFORMATION PROCESSING 
2, 3 no more elements of S (* PROCESSING) • 

1,3 ♦ SCIENCE next eleient of S (INFORMATION) 
2,1 P(* SCIENCE) =3 

2.7 select 3 PSE whose AMT is INFORMATION SCIENCE 
1,3 * RETRIEVAL next eleient of S (INFORMATION) 
2,1 P(* RETRIEVAL). = 6 

2,2. select 6 PSE whose AMT is INFORMATION RETRIEVAL 
2,3 no lore eleients of S (* RETRIEVAL) 

Figure 7,9 A trace of autoiated main term 

selections for the PMT tree of Figure 7.3. The 

numeric pairs » refer to recursion level and 

algorithm line number respectively.'. 



Titles covered ty 
AKT exclusive PSEs 

IHFCSBATIOH - a • 

INP08MATICN CCHTBOL ' 5 

INPCRBATIOH EISSEMINATION 2 

INFORKATICN PECCESSING 3 

INFORHATICN PHOCESSING CONTROL 2 

INPOHBATION SCIENCE' '3 

INFCHBAflON. BETRIEVAL 6 

25 , ^ 

Figure 7* 10 A suiiary of autbaatic lain tera 
selections perforied on the PMT tree of Figure 7.3 



7.7. Automati c AMT S e lection Failures and_the,ir Reaedi es : 
The KWIC-DKHIC Hytrtd Index ^. 

The AftT selection algorithm discussed previously bases 
the selection procedure on two criteria usually used by 
, index analysts. The first is th^ specificity of the 
potential laih ten since the more specific--a main tera^ thel 
•ore infomation conveyed to the as€fr. The second is the 
nuaber of >occuyrences of the PHT to determine the importance 
of a phrase in the context jof tho data base being indexed. 
The analyst usually chooses a iiore specific main term, where 
possible, provided there are a sufficient number of 
occurrences in items found in the data base* 

There are situations, hovever, trhere the most specific 
PKT is the aost appropriate even if it occurs only once in 
the data base (e.g., "AMERICAN CHEMICAL SOCIETY", ratW 
than "AflERrCAH"^ or "AMERICAN CHEMICAL", an example taken 
froa the prototype DKMIC index). Consequently, a selection 
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algoritha which deteriines specificity of aain teriais solely 
on the basis of occurrence of phrases in the data base . will- 
fail when the technique is applied to low occurring phrases. 
An instance of this is , ^bowh in Figure ?• 10 where the 
selection algorithm chose «INPOR«ATION SCIENCE" over the 
■ore specific ten ••INFORMATIOM SCJENCE PROGEAKS" which 
probably would have been chosen by an index analyst. 

The occurrence frequencies ot these low occurring 
phrases usually fall below the threshold for creating DKBIC 
perauted subordij:iate entries. Consequently^ they h^ve been 
formatted as KHOC-type /entries in the KWOC-DKWIC hybrid 
index. . The failure of the selection algorithm, then, 
results f roi its ^inability to select or create the most 
appropriate main term in'^a KHOC-type entry for these low 
occurring specific concepts. However, as discussed in 
section 3*. 2.1, the KHpC-^type format has few advantages over 
the KHIC format. Extraction of main terms makes the KHOC 
format resemble traditional indexing formats, but the user 
stilX has to scan the context of the title ^to recognize 
fully the meaning and .usage of the actual main term. The 
KilC fomat, on the other hand,' does not require the reader 
to search for the context about the key phrase since the 
remaining part of the title is immediately presented. The 
KWOC-DKHIC hybrid index (section 5.1) evolved as such simply 
because KWOC-type entries seemed to be consistent with 
DKHIc-type entries. However, if the index column (or key 
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Hi'ndow) ot KWIC index entries were left justified, they 
would be equally as much' compatible with DKWIC entries as 
are the KMOG-type entries. The ' KWlc-type entry would 
resolve the selection problea mentioned above in that the 
word in ie'ft-justif ied index coluar would be followed by all 
of the ceaaining words in the title, thus, making the aain 
index teri. for' a' low occurring concept as specific as 
needed. 



COHPOTER(S) GRAPHIC (S) , " . , " ' 

* • • .. ......... 085 

ANALYSIS PROGRAH: ECOCATIOHAL APPLICATIONS IN ELBC+ 073- 
APPLICATIClfe IN. ELECIRieAL ENGINEERING ♦DUCATIONAL 073 

♦ CIR.COiT ANALYSIS PEOGBAH: EDUCATIONAL APPLICATIONS* 073 

COUBSi IN * ...r.... AN ELECTIVE 068^.1 

EDOCATIONAL APPLICATIONS. IN ELECTRICAL ENGINEERING* 073 

ELECTIVE COOBSE IN ♦ AN 068-3 

ELECTRICAL ENGINEBEING ♦DOCATIONAL' APPLICATIONS IN 073 
ENGINEERING ♦DOCATICNAL APPLICATIONS ' IN ELECTRICAL 073 

FACOLTY VIEW: ♦ . ..^ . 264-U 

PBESHHAN AND * 1..*...; ...^THE 068-2 

♦ INP01BATIGN PHOCESSES AT THE ONDEBGR ADUATE LEVEL .. 068 
LEVEL .ISFOBMATION PROCESSES AT THE ONDERGFADUATS 068 

* ..i....... 068-1 

PROCESSES AT THE UNDEBGRADOATE LEVEL .INFORMATION 06P 
PROGRAIt:: .EDtlCAflCNAL APPLICATIONS IN ELECTPICAL EN* 073 

STdDENTS' VIEI: ♦ " 26a-2 

ONDBRGRA^P'JATE LEVEL . INPOBSATIOS PROCESS ES AT THE* 068 

VIEH: ♦ X. FACOLTY 26^^-^♦ 

VIEW: ♦ STODENTS*. 26^-2 

CCHPUTING FACILXJIES AT THE UNIVERSITY OF ALBEKTA ♦HE 257-3 
CONSTBOCTION ENGINEERING FOB HIGH SCHOOLS ♦.C STODY OF 080 ' 
CONSOLTATION SERVICES— ACCBEDITATION 153-4 



Figure. 7.11 Display format for the KWIC-DK«IC 
hybrid index \ 

L__ \ 



Other advantages cf the\KWIC-DKHIC hybrid format are as 



follows: 



1) .The/ overall size of the index is reduced since the 

f ■ 

KWIC entries require no aain term heading • 

2) The size of the index is further reduced since -each 
K»IC entry, requires a single print line while the KWOC- 
type entries utilized in the KHOC-DKHIC hybrid index 
occupy as *aany lines as- necessary to- contain the entir^e 
title. 

3) An accurate account p£ the nuiber of lines necessary 
to print the index can be accuaulated during the index 
generation process. 

Figure ?• 11 depicts a portion of an index in KtflC-DKWIC 
£oraat« 

/* 

7.9. la pleaen t ation c f A^t o iated AMT Selection in KWIC-. 
DKHIC Hybrid , Indexes ~ 

The lethod of jnstructing DKWIC index entries from 

■aximail main terms differs considerably from either of the 

other DKHIC implementations ^previously, described* With 

maximal main terms^ no potential subordinate entries can be 

constructed until the specificity of the actual main, term is 

determined. As a cons.equence, the generation prbcess 

req^iires five distinct steps (Figure 7.12) which are 

developed in the five subsections that follow. 

7 . 3 • 1 • Gene|ation^o f , Maxim al Main Terms 

Prom the input data base, the main- term and 

subordinate- term stoplists, and" the authority list, two 

files are generated • The first file is a title pointer file 
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I TITLE 
I PHRASES 



■I VUTHORITTf \ 
I ' L f ST . ■ I 



r 1 

I TI-TLE |<- 
I POINTESS I 

L 1 ij 



L 



•>! EXTRACT |<- 
-( MAXIMAL (<- 
-I MAIN TERMS- | <- 



HMT 



I 
I 

l-i 



I .RECORDS h«->| 



SORT MHT 



j CREATE PMT TREES I 
I SELECT AMT 
I _i. I 



I T 

•>| ALTER MMTS j <- 
I TO AMTS I— 
t- J 

t 

V 



I i 



-1 
I. 



■I MAIN .TERM I 
I STOPLIST I 



■j SUBORDINATE ( 
I . STOPLIST j 

L—L , < 



->| AMT \ 
—I MARKERS I 

L ^ : I 



->| AMT • I 
rl RECORDS ' | 
|»- 1 

r— — — I • • ~ • . 

->| EXTRACT AMT FROM TI.TLES ^ | <-«■ 

I CREATE SUBORDINAT.E ENTRIES I r — '—^ i 

->| PERMUTE WHEN NECESSARY |->| INDEX' | 

i : . , — : . J ENTRIES I 



I 

V' ■ 

r '' SORT }<— 
I PRINT ENTRIES i' 

t : i- — 



Figure 7.12 The systei design for creating KWIC- 
DKHIC hytrid indexes with automatic AMT selection 



where a fixed length' record is constructed for each input 
title record. Each record in the tktle pointer file 
consists of five arrays which specify the location^ length, 
■ain-term stcplist disposition, subordinate-term \ stoplist 
disposition, aiid the class of tt^rininatih^ punctuation for 
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I 1. BEAD 6 SORT STOPLIST 
I 2. READ AOTHORITt I.IST • 

I 3. INITIALIZE irORD. FINDER S; TITLE SE« NUK 

>| tt. READ NEjtT TITLE,, BUMP S8Q NUH-NQ MORE? 

I 5. "CREATE TITLE LOCiTOH DATA • ^ . 

I 6. A^^PLY ADTHORiTY LIST TO ENTIRE TITLE, 
I . DELETE EUildTUATIOH . • - 

>| % LOCATE ilEXT:-.HDRD IN, TITLE - NO MORE?, 

L T| 3.MS «0HD\C,H P:ijMARY STOPLIST? . 

« |. 9. iNITfALIZE^ SP|CIPICITY TO 1 

I 10. 'A'^SOCMTE. LiBNGTH OF COSRENT MMT HTTH 
I SP'^ECII^i^CITY & STORE IH HMT RECORD 

r >rl1. IS SPECIFICITY AT MAXIHOM?. 

I r->M2. IS PONCT^ATION TERMINAL AFT LAST WORD? 
I I I 13. ADD !lvBXX^^(DBt)'TO SMT, INSERT BLANK 

I . «.-T| ia. . IS lX^s;t- rolb^ ON secondary stoplisT? 

« 1 1 . ^ INCREAS. E S^EeiFICIT Y 

I 16. SHOBTEHXmHTXTO LAST recorded LENGTH 
I , AT CORfXnT SfBGIPICITY, IF NECESSARY 

-117. WlITE*OUT. HMT PECORD ' ^ 

118. WRITE OOtNtIT^LE LOCATOR DATA 

I 19. SORT MMT fXlE^'BY MMT 



T-, 

T-1 

I 
I 
I 

<-ij 



Figure 7. 13; Flowchart describing maxiaal lain tern 



generation 



each. ¥ord in the titX^** 7Kis inforaatiton is recordj^d at 
this tine f or dater use Iti constructing factual - subordinate 
entries froi J:he* corr^spondihgi title. * ■ ^ 

, The^se9ond. file, the flHT -P^il4# consists ^of all- maximal 
.lain teras which- could be constructed from the input title 
data base. Betorded ' with each flHT is: 

a) ,^ che sequence number of the title -frjoa which i^t was 
€xtpacl9£d«.^ ^ J 

b) the nufflber of specif ieit;^ unit^PPfund in the MMT; 

c) the nuafceF of characters 'in any ABT gen^^rated from 
this .»N«T if a specificity less th|in or equal to the. 



constructed specificity is desired^ ^ 

A simplified flowchart for generation of these ^les is 
qiv^^n in Figure 7.13. ' ^ \ ' 

3. 2. ' Selec tion of Actual W ain Te rfgs' , \ \ 

The sor.ted HWT file acts as the prime input source for 
this phase of the index generation. The automated selection 
process consists of thr^^e distinct segments, each of which is 
invoked . for a- MflT group found in the input file. The first 
task is to segment the KMT file into groups and^ in the 
process^ construct the ^ PMT tree and accumulate the 
statistics ccncernin^F<T> and Z<T> (see section 7.a.1) ' for 
each node of the FMT trcev ' ' . 

^ In order to conserve space^ -the PMT tree representation 
contains two entry types. The first type is a normal node 
entry which contains three parts: • P<T> - the number of 
potential main terms that could be' generated for this node; 
Z<?> -the number of terminal PMTs for thxs node; and a 
filial link to indicate the next entry in^ the successor set 
that contains this node. ;^The second type is for termi'hal 
nodes representing F Mis ''of maximum specificity', where P<T> 
is equal to Z<1?>.\ For these nodes only one entry in the 
tree structure is necessary since any brother elements will 
be stored consecutively in the linearized tree format. 
li^nearized PMT tree for the MMT group show\ in Figure 7.4 is ^ 
illustrated in Figure 7.ia. • a flowchart ,__describinV" 
cons,truc^ion of . the.. -FHT tree ard the accumulation of the 

' 



P<T> and Z<T> statistics is depicted in. Figure 7.15, 

Once the PMT tree for an NMT group has been built^ the 
AMI selection procedure outlined previously (sections 7.5 
and 7.6) chooses the actual specificity of each AMT (see 
Figure 7.16). Since the records from the MMT file necessary 
to construct the PMT tree have already been processed^ the 
selection procedure indicates the manner in vhich the^MMTs 
found in the tree should be altered by creating isarker 
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riqurr? T.l^i An i.Hustratit)n 0;f the linearized P^T 

tree fcrmat tor i-he M.IT /group i llust.ratod in 
?i j'ire 7,4 . Oil y the quantities labele i "tree 
element" aro stored', ; 
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1. RECORD THIS SPECIFICITY AS SPECIFICITY 
OF GROUP LFADER 

2. SET NO riATCHSD SPECIFICITY UNITS; 
SET LAST SPECIFICITY TO ZERO 

3. INITIALIZE THEE SEQ HUM 

U. IS HATCHING 'SPECIFICITY LESS THAh LAST 
SPECIFICITY? , 

5. .CREATE THEE ELEMENTS IfP TO SPEcIfICITY 
OF LAST "IMT - ' SAVE SEQ NUM OF PUT 

'■ ELEMENTS- OP SPECIFICITIES LESS THAN 
MAXIMUM ^ ' 

6. RECORD TPTS SPECIFICIT? AS LAST 
SPECIFIC! 

7. READ y.MT tILE - COUNT MATCHING MMT; 
THIS IS, Z<T> FOR' LAST SPECIFICITY 

8. COTINT NDME.ER OF SPECIFICITY UNITS 
MATCHED. IN FIRST NON-MATCHING MMT; 
RECORD SPECIFICITY OF MMT AS THIS 
SPECIFICITY 

9. UPDATE P<T> FOR ALL TREE ELEMENTS FROM 
WHICH MKT HAS .CONSTRUCTED 'bY ADDING 
Z<T> TO THE ACCUMULATED P<T> OF ALL 
ANTECEDEKTS 

10. IS THIS SPECIFICITY LESS THAN HATCHED 
SPECIFICITY?, 

11. CREATE FILIAL LINKS FOR TREE ELEMENTS 
BELONGING TO THIS NODE BY RECORDING 
TREE SEC NUM" IN LINK POSITION OF 
ANTECEDENT TREE ELEMEi?TS OF GREATER 
OR EQUAL SPECIFICITY- 

12. WERE NO SPECIFICITY ELEMENTS MATCHED 



Figure 7.15 Flowchart describing the construction 
of a PMT tree frcn a MMT group 



recor'ls. Each marker record consists of four items: the^ 
initial sequence nuiDbec of a contiguous set of MHTs to which 
t.he selects! AilT ^'pe'cif icity applies; the endit\g sequence 
number for this set; the specificity of^ the aht- selected; 
and a fourth field, always zero, which is required for 
proper collation. A second type of marker is generated 
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2. 
3. 



"5. 



6. 

7. 

8. 

■ 9. 
10. 



11. 
12. 

13. 



prCOHD END CF TREE (I.E. RANGE OP PMT 
HAVING SPECIFICITY 1 

INITIALIZE SPECIFICITY, TREE SEQ NOM 
HAVE ALL FILIAL ELEMENTS FPOfI THIS 
SPECIFICITY SET BEEN EXAMINED? 
SET OP BRANCHES IN TREE T(\ ELEMENT OF 
HIGHER SPECIFICITY AND NEX\ ELEMENT OF 
SAME. FILIAL SET 

IS THE NOHBER OF TITLES COVERED BY 
THIS ENTRY 

A) LESS THAN MINIMUM? 

B) LESS THAN MAXI^TJM OR OF MAXIMUM 
SPECIFICITY? 

C) GEEAIER THAN OR EQUAL TO MAXIMUM? 
THE NUMBER OF MMT ENTRIES AT THIS 
SPECIFICITY ARE CHOSEN AS AMT 
INCREASE SPECIFICITY, POINT TO FIRST 
ELEMENT OF THIS HIGHER SPECIFICITY SET| 
THE NUMBER OF PMT ENTRIES AT THIS 
SPECIFICITY ARE CHOSEN AS AMT 
POINT TO NEXT ELEMENT OF FILIAL SET 
THE NUMBER OF PMT ENTRIES AT 
SPECIFICITY ONE LESS THAN SPECIFIED 
ABE CHOSEN AS AHT . 

POINT TO NEXT ELEMENT OF FILIAL SET 
DECREASE SPECIFICITY AND RECORD AMT 
COUNT AT THIS SPECIFICITY 
IS SPECIFICITY EQUAL TO 1?. 



Figure 7.16 Flowchart flescribing the AMT selection 
process 



which conveys the numter of exclusive PSFs which a specific 
AMT will head. This riarker is distinguished from AMT 
Barkers by a zero ending sequence nuaber. The beginnina 
sequence number is the MMT sequence number of the first AHT 
of this set. The fourth field of this record contains the 



exclusive PSE count. This^ information is placed in the 
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Barker file to determine whether the subordinate entries of 
this aain term should ke permuted. The tvo marker' formats 
are displayed in Piquc€ 7.17. 
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Figure 7.17 


The formats of the actual 


main term 






and the exclusive PSB Barkers produced 


by the AMT 






selection algorithm 







The final step in selecting actual\sain terms from a 
MMT qroup involves sorting the term markers for the group. 
All markers are stored temporarily in main memory until all 
selections have been made from a group. Since the exclusive 
PSB markers need to be placed before all references to the 
HMTs they concern, the sort is performed on the first two 
fields of the marker records. When the sort is complete, 
the markers are written onto a file and the selection 
process continues vith the next MMT group. Figure 7.1o 
displays a sorted set of markers and the implied selections 
performed on the MMT group of Figure 7.4 for maximum posting 
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liait of U and oiniiuai posting liait of 2. 
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Select ion^M^rJcers 
be^i^endi ££§£ cnt MrtT 



IN F OBWATION CONTBO L / 
INFOgHATION COHTROL BY AUTOMATED 

INFO R HATION DISSEMINATION 
INFOBMATION DISSEMINATION TO SCIENCE 

ISFORMATIOH PROCESSING 

I^FCRHATIOH PROCESSING CONTR OL 
INFORMATION PROCESSING UTILITY 

INFORMATION PROCESSING PROGRAMS 

INFCgMATION RETRIEVAL 

Figure '7* IB An illustration of the AMT and 
exclusive PSE count, aarkers autoaatically produced 
by the AMT selection algoritha froa the MHT group 
of Figure 7»4« A aaxiaua posting liait of 4 and a 
ainiaua posting of 2 was used. 



7 . 8 • 3 • Generat i on_of_ AMTs^From 
Marker File 

The aaxiaal aain tera file and the actual aain tera 
aarker file are proct^ssed in parallel during this phase of 
the index generation (Figure 7.19). Two distinct operations 
are performed: the HMTs are altered to the specificity 
indicated by the markers produced in the last phise; ar.d^ 
each newly generated actual aain tera is coded by a tield 
which designates the type of ASE that should be formed for 
this aain tera. 
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The Backer file forns a non-overlapping sequence of 
instructions to aodify each record of, the HMT file. Because 
of the sorting technique applied during the selection phase, 
an' exclusive PSS marker precedes the first reference to each 
new actual aain tera entry that is to be constructed (see 
Figure 7.18). Because of the organization of the MMT file, 
all aaxiaal main teras that are to be modified to the. 
specificity indicated by the exclusive PSE marker. will be so 




1. GET FIBST KRT RECORD 6 INITIALIZE SEQNO 
.2. GET NEXT BARKER RECORD 
3. IS IT AN AHI MARKER? 

ti. READ HHT FIIE OHTIL HMT. SEQNO MATCHES 

5. SET PERHOTATION FLAG IN MMT RECORD ' 
IF RECORD CODNT FOR THIS SPECIFICITY 
EXCEEDS PEEBOTING THRESHOLD 

6. ALTER KMT TO STATED SPECIFICITY 8 SAVE 

7. HAS END OF MHT SEQnSNCE BEEN REACHED 

8. COPY AB-B B PERHOTATION FLAG TO HMT 
RECORD AND HRITE OUT 

9. GET; NEXT MHT RECORD - NO MORE? 

10. SORT AMT FILE TITLE SEQNO FOLLOWED 
BY AMT 

11. MOST BE COONT MARKER, RECORD COUNT 
.\T INDICATED SPECIFICITY 




Figure 7.19 Flowchart describing the tailoring of 
MMT records to form- actual main terms 



altered before another AMT grojjp of this specificity is 
encountered. An arbitrary number of AMT groups of higher 
specificity .may appear before the termination of this AMT 
group. Consequently, the exclusive PSE counts are stored 
only by specificity. 
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The aodified MHT records are recorded on a separate 
file so that the selection process may- be performed again, 
if necessary, .without requiring a re-execution of the 
laxiaal lain term generation phase* 

In preparation for the next step, the AHT file is 
sorted on the conbined field of title sequence number 
fclloved by the actual main term« 

7 • 8 • U • Ac tual Subordinate . Entry (hS .El Cons truction 

No subordinate entries, have to this point been 
generated, yet much information concerning them is knotrn* 
The number of distinct subordinate entries is equal to the 
number of records in the actual main term file* k count 
could have easily determined how many of these terms vere to 
form permuted subordinate entries* (In fact, by the end of 
the selection phase enough information can be gathered to 
determine an accurate estimate of the size of the index £or 
various permutation thresholds*) 

All actual main terms to be extracted from a given 

title are collected in an alphabetical subsequence on the 

AMT file prepared during the MMT tailoring phase* This 

arrangement allows a sequential processing of both the AHT 

file and the original data source* This format also permits 

multiple occurrences of an actual main term to be 
♦ 

simultaneously extracted from the title and still process 
the A«T file sequentially (see Figure 7.*20). 



130r 



I — > 
I— T 



1. GET FlPSt ART RECORD 

2. GET NEXT TITLE 5 CORRESPONDING TITLE 
LCCATOB RECORD. r NO MORE? 

3. DOES TITLE=C0F5ESP0ND TO'COERENT AHT 



4. 

5. 

6. 
7. 



8. 

9. 
10. 
11. 
12. 
13. 



14. 

15. 
16. 
17. 



18. 

19. 



ONPACK ACCESSION CODE PROM TITLE RECORD 
RECORD HOBD POSITION AND NOfiBES OF 
WORDS, TO EXTRACT • 
IS PSE A DKmc ENTRY 

EXTRACT HOBDS FSOIl TITLE, INSERT AHT> 
ROTATE TITLE SO THAT A«T BEGINS PHRASE 
ADD ACCESS ICN CODE, tYPE, 6 WRITE OOT 
RfAD^ NEXT AKT RECORD r NO MORE? 
.DOES AHT APPLY TO THIS TITLE? - 
DOES AHT HATCH BBCOBDED AHT?. . 
POOND DUPLICATE, RECORD POSITION 
SCHT POSITIONS IN ASGEMDIHG ORDEB 
WORKING PBCa THE BIGHT jEND OF THj? 
TITLE, EXTRACT EACH OCCURBEHCE OF THE 
AHT INSERTING AN ♦ AND UPDATING TITLE 
LOCATOR DATA ■ \ 
ATTACH ACCESSION doDE AND TYPE, 
WRITE OOT 

LOCATE NEXT WOk'D PT TITLE - NO HOBE? 
WORD OH. SECChl)ARY STOPLIST? 
ROTATE TITIe/so.THAT WOBD BEGINS 
PH.RASE, ADe/aCCESSION CODE, TYPE, ADD ■> 
AHT TO REC<5bD AND WRITE OOT 
GET NEXT/ABT RECORD - HO HOBE? 
SORT PfiINT FILE BY AHT FOLLOWED BY ASE 



^Figure 7.20 Flowchart describing the generation of 
ASEs 



Depending upon the code set during the previous phase 
in each AHT record, the resulting subordinate entry is 
either pertuted or recorded as a single entry. Subordinate 
index terms in perouted subordinate entries are controlled 
by the secondary stbplist indicator created for the 
corresponding title during the first phase of production. 
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The iiages recorded on the final index file contain the 
AMTs followed by subordinate entries and an indication of 
the type of foraatting required. 

7 . 8 . S . Printing.,The' KHIC7DKWIC Hybrid Index 
In the final phase of KHIC-DKHIC index generation the 
sorted index-entry records are foraa'tted' for printing (see 
Figure 7,20), The width and length of a printed page are at 
the discretion of the user and "are dynanically constructed 
froB paraaetric descriptions. 



PIMK- 



r, SET OP LISE WIDTH 6 PiGESIZE, | 

INITIALIZE SAVED AMT | 

2, GET HEXT IHDEX RECORD - NO HOBB? |< — ^ 

3, IS THIS A EKMIC ENTRY |P 

4, DOES AMT MATCH SAVED ABT fT — , 

5, PRINT NEH AHT AND SAVE | f- 

6, SET MARGIN KEI IP ASE BEGINS HITH ♦ |< — • 

7, LOCATE END OP TITLE | 

8, IS LENGTH CP ASE GREATER THAN LINE |P , 

9, TRUNCATE TITLE ON RIGHT AND ADD | | 
TRONCATICN SYMBOL 11 

10. EXPAND ASE TO LINEWIDTH, INSERT DOTS |< — • 

11. CONFORM KNIC ENTRY TO.LINEHIDTH | <— 

12. PRINT TITLE AND ACCESSION CODE | 



Figure ''.21 Flowchart describing the printing of 
the final index 



CHAPTER VIII. RESULTS^ CONCLUSIONS, AND DIRECTIONS fOR 

PDTURE RESEARCH \ 

The capabiiitie,s;^cf th^ doable-KHIC coordinate indexing 

technique have been discussed and illustrated in previous 

chapters through isolated comparisons of index entries 

prepared by DKWIC techniques and similar entries prepared by 

other automated indexing schemes* In each of these 

examples, the DKHIC entries demonstrated properties superior 

to other KilC j|.ndex variants. In this chapter, I intend to 

demonstrate that these propertied are retained in a KtfIC- 

OKillC hybrid index when certain selection criteria are 

observed. The results from this study clearly indicate 

roads for future improvements of the indexing system, 

8 • 1 • Influence of Various Parameters oh^ Characteristics of 
the_Indexi_ and Suppo rting iBxperimental E vidence 

The success of automated main term selection lies in 

the distribution of the words and word phrases found in the 

collection of titles to be indexed. This distribution is 

affected only by the vocabulary-normalizing functions whicjh 

merge words having common stems into a single group, and the 

titles themselves which form the. basis for the word patterns 

counted. The stoplists, though extremely important for 

determining index descriptors, dictate only which discrete 

groups of the word distribution should be considered in the 

indexing activity and which consecutive words of a title 

132 I 
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should be chosen as lain-tera phrases* Consequently, the 
stoplist affects the* content of wo^d groups but not their 
distribution. 

two distinct paraiieters affect the specificity and the 
foriat of nain teras chosen fros the vord-phrase 
distribution of teras* The posting thresholds deteraine 
vhich aain teras should be selected froa groups of teras 
having a coaaon leading descriptor. The perautation 
threshold independently acts to, divide the distribution into 
t«o groups, those aain teras v^ich. trill be posted vith 
perauted DKHIC subordinate entries, and those posted as non- 
perauted KHIC entries* 

Figure 9*1 illustrates the aanner in vhich these two 
paraaeters affect aain teras through interactions with the 
phrase distribution* The curve represents a rank ordering 
of the occurrence frequencies of distinct descriptor 
phrases* Bxperiaental evidence has shown that this 
distribution follows Zipf*s law (Zipf,^l9}* 

\ The posting thresholds, labeled "iaximuffl posting" and 
"iiniaua posting** in Figure 8*1, operate locally on 
descrijptor groups* Any aeaber of the group which exceeds 
the aaxiaua posting threshold (e*g* teras A and AB in Figure 
8*1) will be altered in favor of teras which fall between 
jthe two posting liaits (e.g. tera ABC) while those falling 
below these liaits are entirely eliainated (e.g. term AC)* 
Because of the constraint of, producing a covering index, the 
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PERMUTATION 
THRESHCLD 
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Figure R.I A graph illustrating infljtfence of 
Bin iiuffl , posting threshold, Baxijiur posting 
threshold, perautation thresholdr and word 
occurrence frequency on the selection of "AMTs 



terms which exceed the' threshold are retained in a modified 
form which excludes these entries covered by other terms of 
the group. The modified group of terms is denoted in the 
figure as A* and AB». Thus, the aaxiatia and ainiaum posting 
thresholds aodify the distribution of teras as well as the 
selection of aain teras for the final index. 
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The perautation threshold ; when applied to the 
distributioa disregards boundaries . of maxisaal main tera 
groups and acts globally without /concern for the decisions 
Hjade ' bV , the selection process. Only the occurrence 
frec[u,^ncy is considered^ / ' . ^ 

Although the perautation threshold and the posting 



thresholds are applied /independentiyV their resulting 
interaction can affect' the/quality of the final index. In 
the exaaple presented . ii^/ Figure 8.1, the posting thresholds 
led to choosing the aain tera ABC over" tera AB. The 
resulting distribution placed the occurrence of these two 
tei;a9 below; th6 perBUtat,ion threshold. The teras AI^ and ABC 
would hajye been foraatted as KWIC entries and grouped 
together in the index. Hid th^ posting threshold parameters 
either allowed the / acceptance of term AB by raising the 
aaxiaua- posting liait or rejected the tera ABC by raising 
the ainiaua posting liait, the entries grouped under the 
tera A3- would have bieen selected in its original' fora for 
the final index and would have been formatted with perauted 
subordi nate terms . 

In order to further discuss these problems, some actual 
data from an index generation ^^>i-irl:is be examined. Figure 8.2 
lists the general statistics concerning the title 
collection.. The titles of this data base were short 
descriptive phrases containing an average of 7.3 •words per 

title of which an, average of 2.9 words were deemed 

/ 
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3^2 titles 
270^2 words 

39^^ ' primary stoplist^ Words ^ 

51 T\ . secondary stoplist words 

^96^ primary stoplist *words found in' titles 
1627 \ secondary stoplist words found in titles 
1075 \distinct laxinal main terns generated 

270 specificity 1 MMTs 

26a Specificity 2 MMTs 

541 specificity 3 HMTs 
• . 567 distinct PMT gro.u^ps 



Figure 8.2 Some general statistics concerninc^ an 
ind^x generation I 

? \ •-^ ^ : 



significant after the application of the stoplists. 

Table 8.1 su!!imari2es the.nuaber of aain terms selected 
at a particular specificity while \rarying the aaximum and 
ainiBum posting thresholds, hs was anticipated from the 
discussion concerning the posting threshold pafaneters, the 
average specificity of terms increased as the maximum 
posting threshold is decreased. This- can be seen by reading 
either down a column in the table, fixing the .jninimum 
postiag limit and decreasing the maximum, or by reading 
diagonally down from right to left^ filting^ the difference 
between the iraximum and minimum post ing threshold- while each 
decrease by the same amount. To help clarify the 
interpretation of each entry, consider, for example, the 
guantities listed at maximum posting of 5 and minimum 
posting of 3. This entry indicates that at least 3 titles 
will be posted with each of th*? 13 terms at specificity 3, 
that at least 73 -r 13 or 60 terms at specificity ? will 
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Table 8#1 A coipacisoni of the nuaber of main terms 
generated at a ^particular specificity as posting 
Xiiiits^^re varied* - . ^ 



He^xiiQum 
Posting 



Threshold . 


1 


2 


.3 


4 


6 


1 

i* spec 1 1 


QIC 
0/3 


no i 
So 1 


■1 A '1 'It. 

1 0 1 T 


102 3 




#' spec 2 j 


■180 


84 


56 


, f 3ei 




# Spec, 3 j 


16 


10 


8 


- 8 


■ 


aiLQ spec 1 


1.20 


1.10 


1 .07 




5 


i'spec 1 1 


820 


959 




1 0 2-5 




* spec 2 1 


229 


101 


•'3 


• 37 




# spec 3 1 


25 


15 


■ 13 


13 




avg spec j 


1.26 


1.12 


1 .10 


1 .06' 




# spec 1 1 


728 


944 


982 


1021 




# spec 2 t 


288 


109 


73 


37 




#• spec 3 j 


39 


22 


20 


17 




avg spec | 


t;35 


1 .14 


1.11 


1.07 


3 


# spec 1 1 


677 


919 


. 967 






# spec 2 1 


327 


119 


' 7^> 






# spec 3 1 


71 


37 








avg spec \ 


1 .44 


1 . 18 


1.13 




2 


# spec 1 j 


556 


887 








.# spec 2 1 


' UOI 


131 








# spec 3 1 


118 


. 57 








avg spec | 


1.6C 


1 .23 






1 


# spec 1 1 


270 










» spec 2 1 


. 264 










# spec 3 j 


514 










avg spec | 


2. 35 









1037 
38 
. 0 
1 .04 



33 
5 



1047 

28: 
-0 

,1.03 

s 



posted with at least 3 titles^ and that at least 989 - 17 or 
"916, specificity 1 terms have /fewer than 5 titles in comnion. 
Therefore, to insure that /the higher specificity terms are 
not presented in the KWIC-type format in the final index. 
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the permutation threshold should not be greater than the 
(Diniaum post injj/'liiBit . 

. Table 3*2 illustrates the size and the fraction of 

; 

/ - . ^ 

DKWIC entries which were produced froa the same title 

/ 

collection' for various aaxinun and ainiaua posting liaits 
when the/ permutation threshold assumes the valu*^ assignerl 



the miniauo posting limit. The size of the index increases 

/ . _ , - J 

through /a aaxiaua and jyien_ shrinks as one reads diagonally 

, - 

down the;' table froa right to left. At the higher extreme of 

/ ' / ' ' . ' . • . o 

the posting^ liait values, the majority of the aain terms 

have specificity one, but do not occur at sufficient 



Table 8.2 Index size and the percent DKHIC-type 
entries of indexes prepared from the same titles 
with varicus posting thre^olds 



Haximum 
Posting 
Threshold 




Minimum Posting and Permutation Threshold 
1 2 3 4.5 6 ^ 



1367 
35X 



lines 1 
DKHIC 1 


2G78 
76% 


1878 
69X 


" 1746 
62% 


1567 

5n 


1461 

■ 43% 


\ ' 
lines 1 

DKWIC 1 


1997 
73% 


1854 
67% 


1691 
59% 


1557 
50% 


1461 
45% 


lines 1 
DKrfIC .1 


18€0 
67% 


1826 
6651 


1676 
58% 


1557 
50% 




lines' 1 
DKMIC 1 


1746 
61% 


1777 
64% 


1672 
58% 






lines 1 
DK'-JTC 1 


1463 

4 3«' 


1700 
63% 








lines 1 
DKWIC 1 


1339 
40S 











\ 
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frequency to surpass the periatation threshold. Thus, the 
aajority of the entries are formatted as KWIC entries and 
the size of the index is small. At the lower extreme of the* 
posting limit values, the majority of the terms have higher 
specificity since the maximum postihg limit is small. 
Again, however, the majority of the entries in the index are 
KWIC entries since the occurrence frequency of high 
specificity terms is below the permutation threshold limit. 
I have found, through very subjective measures, that an 
index in ..which about half of the entries are permuted DKWIC. 
entries and half are non-permuted KHIC entries appears to be 
the most appealing. For this hybrid index, the ideal 
parameters appear to ke a minimum posting of 4 and a maximum 
limit of either 6, 5, or 4. The parametric values of 4#4# 
however, have the advantage of supplying the highest average 
specificity for the least\ index size. 

Recall that the permutation threshold . 4as first 
introduced to decrease the site of the fully permuted index. 
Since indiscriminant use of theXpermuta tion threshold can 
impair the quality of the index, further techniques must be 
sought to. independently control the ind^ size. 




8.2. Future^ Resea rch An d Possible,Jn|Broye|^^ In The DKWIC 
Indexing Technique . ^ 



Some =ireas of possible research and x^possible 
improvements in the DKWIC indexing technique are discussed 
in the next three subsections. 



3, 2. 1. Ac tu al Subo rd inate., Entry Rp gu lation 
The effect of the DKWIC indexing technique on index 
size has been cited as one of its aajor disadvantages when 
compared with the KHIC indexing technique. The siz*^ 
difference results from the construction of permuted DKWIC 
subordinate entries. Hany of these subordinate entries 
could lead to fals<^ cccrdinations wibh the main term because 
all reaaining significant words . in the title appear as 
subordinate index teras regardless of the number of distinct 
concepts found in a title. Reduction of the number of 
possible false coordinations in the index entries should 
improve the quality as veil as reduce the size of the index 
produced. In some EKMIC indexes which have been produced 
fJCED,70, ASEE,71}, a high permutation threshold for the 
construction of the higher-quality DKWIC^type entries has 
been arbitrarily imposed because this parameter was the 
primary determinant of the index size after the vocabulary 
of the data source? had been determined. Consequently, much 
of the power of the DKWIC format was lost because of the 
large number of non-permuted entries found in the index. 

The reduction of the number of permuted subordinate 
entries generated could be used as another size-determining 
parameter* Purthermote, under this approach, the threshold 
for constructing DKWIC-type entries could be set 
significantly lower resulting in a highf»r-qualit y index of 
greater depth for a given index size. 



6 
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Several approaches to limit the permuted subordinatH 
entries appear possible. A manual subordinate ertry 
selection procedure, could be implemented, but, as pointed 
out earlier (section 7,2), this^ approach would place a 
considerable burden on the index analyst vho would be 
responsible for examjLning each subordinate entry and 
choosing those having relevant coordinations with the main 
,term. A good on-ldne text editing capability! might 
alleviate much of this burdeH, however. 1 

Proximity relationships between the words in the titles 

might afford a means of determining the more relevant 

\ 

coordinations algorithmically. Several approaches which 
would allow parameterized subordinate term selection based 
on distance measurements about the extracted main term are 
described below (see Figure 8.3 for examples). 

1) Choose n significant words to the left and m 
significant words to the right of the extracted main 
term as relevant subordinate terms. 

2) Delimit the boundaries of subordinate term selection 
by the terminal punctuation surrounding the main term. 

3) Limit subordinate terms to all words up to and 
including the first type-one specificity unit to the 
left and to the right of the main term. 

# 

4) Use some combination of the three, measurenient 
criteria stated above. 
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Title \ 

The "Double-KWIC Coordinate Index, ii. Use Of . An \ 
Autoaatically Generated ^ Authority List To ^ 
Elifflinate Scattering Caused By Soae Singular And 
Plural Main Index leras 



Actual Wain Terg 

AUTHOBITY LIST 



Subordinate, Entries (only first word of subordinate entry 

shovn) 



1) choosing 2 significant words to the left and 
right of ihe actual tain tera 



/ 



AOTOMATICALLY ELIMINATE 
GENEBATID , SCATTEHING j- 

2) choosing all significant words in the interval 
containing the main term and bounded by terminal 
punctuation / 

AUTONAnCALLY CAUSED 

ELIMINATE GENERATED 

INDEX HAIN 

PI'^BAl SCATTEBING 

SINGULAR TERMS 

3) choosing all significant words up to and 
including the next type 1 specificity unit to the 
left and right (underlined above) of the main term 

AUTOMATICALLY 
ELIMINATE 
GENERAtED 
INDEX 

SCATTERING 



Figure 8.3 Subordinate terms generated by applying 
some word-prcxifflity restrictions to ASB selection. 
The words "AN",- "EY", "II", "OP" "SOME", "THE", 
"TO", and "USE", appear on the subordinate 
stoplist. 
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Paraieterized subcrdinate entry selection provides an 
added dimension to the EKMIC generation process. By varying 
the aain tern posting-perautation thresholds and the 
subordinate entry paraieters^ a wider range of indexes could 
be produced than could fce realized by one or the other^ of 
these parauet^rs alone. 

8.2.2. A utoaat cd Gene ration of **Seel'^and_"See 
llso^V Cross . References 

The automatic generation of "see*' and "see also" cross 

re^rences could result f ron special treatment of some 

stoplist entries. Consider an- action which could be easily 

performed when a particular word is found in the stoplist. 

Linked' to this word is a preferred index word <or phrase) 

which would be added as an enrichment term to the ti|:le from 

which the stoplist' word was found. h marker indicating the 

presence of the stoplist word in a source title would be 

recorded. Processing of the enriched title would continue 

normally with the stoplist word not participating as a type 

one specificity unit. The preferred index word having been 

added to the title^ would form a maximal main term and be 

chosen as an actual main term during the selection process* 

Each title containing the stoplist word or any other word 

linke:! to the same preferred word would be handle'! 

similarly. After all maximal main terms had been generated 

for the source titles, the presence markers for all special 

stoplist words would be interrogated. For each word that 




was present in the source titles, a pseudo title would be 



generated containing the stoplist word, the preferred word, 
and "SEE" '(see Figure 8.4). The stoplist disposition 
indicators could be set to allow indexing to occur only for 
the stoplist word. The nomal aechanisos for generating the 
index would produce a main ten for the stoplist . word with a 
subordinate entry "see" reference pointing to the preferred 
word entry. This procedure permits title directed "see" 
referencing which can be a aeans of eliainating so«e 
scattering in';Lthe index produced by the appearance of 



litle 

HAHOS: AM IBSYS SUBSfSTEH FOR PROGRAflHIIfG LAMGUAGB 
EXPANSICMS.= 

Index Teras 

ps€fudo title 

MAMOS SEE OPERATING SYSTEMS < \ — J 

Preferred main tern 
I 

OPERATING SYSTEMS < ^^-J 

NABOS: AN IBSYS SOBSYSTEM FOF PPOGRAMMING* ... 

4 

. 

PROGRAMMING LANGUAGE EXPANSIO?^ / * 

I -J 

I 

Enrichaent term 
added 



Figure 8*4 An illustration of a "see" cross 
reference and the enriched title from which the 
reference was generated 
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synonyns* Hith a slight aodif ication, this procedure could 
autotnatically add enrichaent terns to titles and allow the 
stoplist vord to be indexed normally* This use would be of 
only minor iiiportance if other iaproveaents -are added as 
explained later. 

•Creation of "see also** cross references for synonymally 
related tens could be perforaed in a aanner sisiilar to the 
creation of *»see" references. The index analyst would enter 
related word groups which would be internally linked within 
the stoplist. Ks words are located during aaxiaal aain tera 
generation, these related words would be marked present as 
they appear in titles. After the HHfs have been generated, 
the groups would be examined and pseudo **see also^ titles 
generated for members of groups having two or more words 
marked present. The stoplist disposition of each of these 
words would be set so that each word would be chosen as an 
actual main term during later processing which would add 
linking ♦•see also" records to each subordinate group. 

Soae "see also" cross .references could 'be generated 
from statistics inherent -in the main term selection process. 
If a significant number of high-specificity terms are 
selected from a PHI tree and entries for a less specific 
antecedent aain term. are also selected or generated, then 
"see also" cross references could be generated automatically 
between the antecedent and descendant main terms. 
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8. 2. 3. Othe r Possi ble In de x 5e f ining^rocgdures 
The distance seasures eiployed in the earlier 
discussions cf subordinate term selection could be used in 
another depth increarsiirg function. Assuming that authors 
construct "good" titles and the information derived from 
different segaents of a title are interrelated, "related 
teras" could be autoaatically generated from vords and 
•phrases iihich lie outside the bounds of subordinate tera 
selection. A aore detailed investigation of tit*le 
properties is necessary to deaonstrate the feasibility of 
this process. ' 

A type cf scattering occurs in DKIIIC indexes which is a 
result of Bulti-vord aain teras. This "structural 
scattering" is deacnstrated in Figure 8.5. The aain terms 
"INFORMATION RETHIE7AL" and "RETRIEVAL OF INFORHATION" 
obviously refer to the saae concepts but because the 
indexing aethod treats collation differences as concept 
differences, scattered entries are produced. If only the 
sigaificant vords cf a phrase were to be considered for aain 
tera generation then structural scattering would disappear. 
A aarriage between Sharp^s SLIC aethod (section 3.1.3) for 
aain-tera fotaatting and DKHIC sifbcAfdinate tera selection 
could result in a new product having the benefits of both 
indexing technigues. However, the deletion of actual-^ words 
appearing in the title aay be detriaental to the indexes 
ability to allow valid coordinate searches. Hore 
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INFCRBATICN RETRIEVAL 

< 

RETRIEVAL OF INFOBKATICN 

Figure fi. 5 An example of structural scattering 
that occurs in double-KWIC coordinate indexes due 
to the syntactic structure of natural language 

investigation irfto these properties is necessary before any 
conclusions can be reached* 
'8.3.* Concluding Reaarks 

In conclusion^ I feel that. the double- KWIC coordinate 
^indexing technique can be applied with fruitful results' to 
existing title or title-like phrase data bases. The 
extensions of this new automatic indexing technique can only 
lead to printed indexes of higher quality reguiring only 
ainor expenditures of intelle.ctual efl^ort. Only through 
wider application and field -test ing of this technique and 

through the dissemination of its products can the real worth 

\ 

of these indexes be deterained. 

The author hopes to further improve the quality of 
indexes produced by these techniques and hopes to have the 
opportunity *of continuing work along the lines mentioned 
previously. It is expected that several of these aspects 
will be investigated under continuing research performed by 
the Department of Compute^^ and Information Science^ The Ohio 



1^8 , 



State University. 
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APPENDICES 



APPENDIX A. CN COUNTING ENTRIES OF AN APTICULATED SOBJECT 
INDEX 

Let us assume that an articulated title phrase'' may be' 

stylized by letters representing components separated by 

' " " ' i 

function words. A phrase having four componenys (three 

articulation point's) ii04ild be; written as ^ ^ 

* * abed 

A subject heading, extracted from the phrase, is a single 

■I 

component; the lodifiers miay be represented by a canonical 
notation by inserting a comma in the phrase at the point of 
extraction 

b-a,cd 

where b is a subject heading and the canonical nbdifier is 
a,cd. All subject headings and modifiers of the phrase abed 
are • 

' ' a^-^bcd / 

b-a,cd / 
c-ab,d / . ^ 
d-abc, I 

If t<i,j> denotes the number of actual modifiers 
produced frcm a canonical form modifier having i components 
to the left and j components to tWe right of the comma, then 
S<n> enumerates the. entries of a title phrase having n 
components: / 

S<n*1> = (i=0,n) SUM (ti!i,n-i>) ^ 
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In order t.o evaluate 5<n>, some relationships aaong the, 
t<i,j> must first be revealedl Define the first i 
components of a canonical modifier as the initial phrase and 
the last j components as the final phrase. Translating 
Lynches rules for the construction of index entries to 
canonical representfation: , 

1) if there , is no initial .phrase, th\i " entry is 
coBplete; ' ^. 

2) ^for each non-ccnplete entry, subentries are "^formed 
by: ^ . ^ - 

A) ' beginning with the last component of . 
the initial phrase, generate i subheadings 
and canonical modifiers by extractitif the 
last, the last two, •••,'the last i-1, and , ' ^ 
the last i components;* 

B) if the initial phrase e'xists., 'extract . 

^ , ! 

* ^ %' 

ttie first and only the first component, of the 
final phrase as a subheading. / ^ 

3) continue applying 1) and 2) until all entries are 
complete. ' ^ 

The three rules given above recursively produce entrjLesr 
from canonical modifiers. The p^rocess ipay represented , by 
a tree structure with the terminal nodes representing the 
actual index entries. The tree r^epresenting the canonical 
decoflrpositiorv of c-ab,d. is: 
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. I 

r 1 

I I 
.c-d-b-a^ c-d-ab-, 

5 

c-d-b-a- , 

The terainal nodes represent the actual index, entries 
and are punctuated as follows: \ 

\ 

1) delete the remaining coooa from the terminal form 

2) replace all dashes (-) with commas except when the 
normal sequence of the phrase is retained (alphabetic 
in the example) ♦ >^ ^ 

It is evident from the construction scheme that: 

a) t<0,m> = 1 i > 0 (rule 1) / 

b) t<n^O> = {i=0/n) ^ (t<i,0>) n > 0 (rule 2a) 

c) t<i,j> = t<i,j^1> (k=0,i) SUM {t<k>j>) 

i,i > 0 (rule 2a and 2b) 

"Applying the • first^ difference with respect to n in b) , we 
find ^ . • ^ & 

d) t<n+1,0> = 2t<n,0> 

Similarly, the first difference with rpspect to i applied to 
c) yields 

e) t<i4l,j> = t<i+1,j-1> r t<i,j-r' + 2t<i,j> 
Let T(x,y) define the generating function for t<^, j> 

/^{x,y) = (i=0, ) SUM Mi=0, ) SUM 
(t<i, (x**i.) * (Y**j) ) ) 
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kh'.Q recursion relation e) instructs the examipation of 
(yxy + 2x)T(x,y) 

= 2x ♦ T(x^y) - t<0,0> - t<0,1>*x 
= T (x^y) - 1 ♦ X 
Solving for T(Xry) yeilds - 



/ 



/ 




I T(x^y) - (l|x)/(1"2x>xy"y) 

A itable of some of tjhe coefficients of t/he terms of T(x,y) 
is^ given- in Table Aivt* 



'Table A*1 The/ number of index Entries generated. 
fro» a titled having n initial p^irases and m final 
phrases / . t 



Initial Phrase 



F 

i ; 
n i 
a ' 
1 ' 

P , 

h 

r 

a 

s 

e 





0 ' 


1 2 


3 


4/ 


5 


6 


7 


0 


] i 


1 2 


4 


8 


16 


32. 


64 


1 




2 5 


12 


j. 


64 


144 


320 


2 




i3 9 


25 




168 


' 416 


1008 


3 




i ia 


44 


1 129 


360 


968 


2528 






5\ 20 


70 


225 


681 


1970 


5500 


5 




6 \ 27 


i 

104 / 


363 


1182 


3653 


10836 


6 




7 • 35 


147 i' 


553 


1925' 


6321 


19825 


7 




8 \ 


200/ 




2^84 


10364 


34232 



Reca'jlling that the Lotai number/ of entries for a phrase, 

■ v ' 

S<n+1> = (i=0,-r.l| SUH (t<i,n-i>) 

is represented as the suml of the diagonals of the matrix 
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above. This sub can be expressed , in cl%^ed form by 
rearranging scae of the previous expressions as 

2t<i,j> - t<i,j-1> = t<i*1,j> - t<i+1,j-1> 
and exaoining 

2S<n*1> - S<n> = 2*(i=0,n) SOn (t<i,n-i>) 

substituting , . 

- (i=0,n-1) SUM (t<i,n-1-i>) 

= 2t<n,0> ♦ (1=0, n-1) SOH (2t<i,n-i> - t<i,n-1-i>) 

and upon, substitution of the recursion relation 

= 2t<n,0> ♦ ,(i=0,n-1) SOH (t<i*1,n-1> - t<i+1,n-1- 
i>) n . 

upon rearranging . ' ' 

= s"<n*2> - S<n*1>.* 2t<n7^»>^^^ t<n+1,0> ♦ t<0,n> ♦ 
t<0,n+1> > 

Substituting a) and d) , all terms involving t cancel. Thus, 

S<n*2> - 3S<n*1> ♦ S<n> = 0 
which can be easily solved. 

Soae values for S<n> are listed below: 

n S<n> 

11 ' * 

2 2 

3 5 

a 13 

5 34 

6 89 

7 23 3 

8 610 

Examining the recursion relation for a Fibonacci series 
F<i*2>'= F<i*1> ♦ F<i> 
it is interesting to note that 
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•-F<i*2> ♦ P<i*1> ♦ P<i> ♦\ 
F<i*3> - P<i*2> - F<i*1> ♦ 
F<i*a> P<i*3> - F<i*2> 

=0 

and may be rewritten as 

F<i*U> - 3P<i*2> ♦ F<i> = 0 



Let i = 2n and the equation above represents S<n>, or S<n> = 
F<2n>. Since S<0> is undefined and S<1> = 1/ S<n> actually 
is represented by^ F<2n-1>, F<0> = 0 , and F<1> = 1. 
Consequently, S<n> is represented by the odd elements of the 
natural Fibonacci sequence* 



APPENDIX B. CN ESTIMAUNG THE NOHBER OF ENTRIES OF A KWIC- 
DKWIC INDEX 

Because of the nature of DKMIC indexing principles, the 
number of entries generated from a single title cannot be 
estimated easily frca a stylized model. Many global 
characteristics which depend- on the document collection 
contribute to the number of entries generated from a single 
title. For example, permuted subordinate entries are 
generated only when the number of entries to be posted, 
beneath an actual sain term exceeds a predefined threshold. 
Although these attributes could be estimated through 
probablistic analysis, the distributions required are 
difficult to obtain in full generality and depend heavily on 
the titles being indexed. 

In lieu of these difficulties, the necessary 
distributions are calculated as part of phase 2 of the 
automatic selection prccess for generating DKWIC indexes. 
When an exclusive PSE frequency marker is generated by the 
auto-selection algorithm, the freguency is used to locate a 
counter \^in an array of counters and increment its value. 
After the selection process has operated on all KMT groups, 
the resulting array represents the density of titles 
collected by actual main terms. 
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APPENDIX C. SYSTEM INSTALLATION AND EXSCOTION INSTRUCTIONS 
FOR TKE DOUBLE-KHIC COORDINATE INDEX 
SUBSYSTEKS 



C. 1 Fora Of The Distribut ed jnde xing^Subsystemg 

Two complete double-KHIC coordinate index subsystems 
consisting, of 14 data sets are distributed on 9-track, OS- 
standard-labeled, 800 bpi tape with VnLume label DK«IC; 
Both the KHOC-DK»IC and KWIC-DKtfIC generators are included 
as well as the supporting authority list generator and a 
model data base interface subroutine. The first 10 data 
sets contain the PL/I Version 5.2 source and OS/360 assembly 
source for the indexing systems* The object and load 
modules for the source programs are contained in unloaded 
PDSs of files 11 and 12 respectively, /ile 13 contains some 
useful JCL procedures which will aid the installation and 
execution of the indexing systems. The last file is a copy 
of this thesis in upper-lower case print form. The 
characteristics of these data sets are described below. 
foffiSi cont ent 

1. DKWIC.L1 FB KHOC DKHIC source (PL/I) 

2. DKWIC.L2 FB Chemical Titles data base interface 

subroutine source (PL/I) 

3. DKWIC.L3 FB word finder subroutine source 

(360/BAL) 

4. DKHIC. LU FB authority list -generator source 

(FL/I) 

5. DKHIC. L5 FB KHIC DKHIC monitor source (360/BAL) 

6. CFHIC.L6 FB phase 1 KHIC DKHIC - maximal main 
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All source modules havf? characteristics 



1 

i 



term generator source (PL/I) 1 

1 

7. DKWIC.L7 PB phase 2 KlilC DKWIC - actual nain tera I 

select source (PL/I) i 

- i 

fl. DKWIC. L9 FB phase 3 KWIC DKHIC • actual main terra | 

modifier source (PL/I) I 

9. DKiIC.L9 PB phase 4 KWIC DKWIC - actual n j 

subordinate jtera jenera tori source — r 

- ^ - (PL/IjT - I 

10. DKHIC.L10 PB phase 5 KWIC DKWIC index print source \ 

(PL/I) • j 

11. DKWIC. L1 1 lEHHOVE unloaded PDS of the 10 object modules J 

of the programs listed above. The !. 

unloaded PDS name is DKWIC. OBJECT i 

bcB=(BECFM=PB,LBECL-80,BLKSIZE=3200) I 

The partitions .are named DKWIC1 I 

through DKWICIO. 1 

12. DKWIC. L12 lEHMOVE unloaded PDS of the load modules for | 

the indexing subsystems. ^ The | 

unloaded PDS name is DKWIC. INDEXLIB; f 

DCB= (BECFM^D, BLKSIZE=3a0b) . When^ | 

loaded by lEHMOVE this data set can | 

be used as a STEPLIB for index | 

generation. The KWOC DKWIC I 

generator is named KWODKWIC, the ; 
KWIC DKWIC generator is named 

KWIDKWIC, and the authority list ! 

generator is named AOTHLIST. i 

13. DKWIC. L13 PB sample JCL for loading, cop piling, 

linking, and executing the DKWIC l 
subsystems 

1U. DKWIC. L14 PB a copy of the print-line images of 

this thesis in upper^ lower case. \ 

This data set should be printed with ! 
a standard Tli print train. ^ | 

ECB=(eECFM=FB,LRECL=133,BLKSIZE=3«58) l 



CCB= (RECPM=FB,LPECL=80,BLKSIZB=800) " i 

1 

and can be updated with the lEBUPDTE, utility, | 

I 
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Ci2 Jo b C ontrol Instjal latiori An d Execution Aids 

With the exception of soie added descriptive coaments, 
this section is a copy of data set DKWIC.L13. This data set 
should be punched and used as ar. aid in installing the DKH.ic 
indexing subsystems. To punch this data ^et^ J:Jje foXloxln.q[- 
DOdel nay be used: 

// ... JOB 

//PCH EXEC PGn=IEBGENEB, 

//STSPBINT DD STSO0T=A 

//SYS0T1 DD DSN=EKilIC..L13,UHIT=2'»00,DISP=OLD, 

// LABEL=13,¥CL= (;RETAIN,SEB=DKHIC) 

//STS0T2 DD SyS00T=B/DCB=BLKSIZE=80 

. //SYSIN DD DOMMY 

The data set DKHIC.L13 ccntaiiis job control language 
procedures which are placed vithin a job stream or 
optionally put in SYS1. PROCLIB* Several parameters are 
provided to tailor "the procedures to a particular 
installation as noted belov: 

OHIT - a direct access class such as 2311 or 231.U. 

Default UNIT=231t».' 
LABEL - the label nuaber of the data set on the distribution 

tape. ?!ust be supplied where indicated ♦ 
SER - a VOLuae serial nuaber of a direct access volume on 

which the object or J.oad modules«*are to reside* Must 

be supplied where indicated* ^ 



! 



to coapile a PL/I source DKHIC program: 



//DKKICOHP PROC 

//CSP EXEC PGH=IE«AA,PAHM=MTR,HEST.,XREF' 

//STSPRIST CD SYS06t=A ■ 

V/SiSl-IM DB r«^l T^HTSPXC E=(TEK,{5,2)), 

// DISP= (NEW, PASS) , 

// .-CCB= (RECPH=FB,LaECL=80;BLKSIZE=86C) 

//STS0T1 DD ■ UNIl = SYSDa,SPACE= (CTL,1) 

//SYSIN DD DSII=DKIIIC.LSLABEL,UMIT=2«»00, 

// DISP=OLD,LABEL=£LABEL,?OL=(, RETAIN, SER=DKHIC) 

// PEND 



DKMICOaP coapiles one of the PL/I source prograns from 
the distribution tape and places the object program on a 
direct access device. This data set can be referenced by 
DSH=*.stepnaie.CllP.SYSLIN. The prograa compiled depends 
upon the LABEL parameter vhich aost be supplied when the 
procedure is called. 

To assenble a 360/BAL source DKWIC prograa: 



//DKBICASN PROC 

//CMP EXEC PGM=IEOASH,PABM=«NODECK,LOAD,XREF' 

//SYSPHINT DD SYSOUT=A 

//SYSLIB DB DSN=SYS1.NACLIB,DISP=SHR 

//SYSGO DD ONIT=SYSDA,SPACE= (TRK, (5,2) ) , 

// DISP= (NEW, PASS) , t 

// DCB= (RECFN=FB,LBECL=80,BLKSIZB=800) 

//SYS0T1 DD ONII=SYSDA,SPACE= (CYL, 1) 

//SYS0T2 DD ONIT=SYSDA, SPACE- (CYL,1) 

//SYS0T3 DD UNIT=SYSDA,SPACE= (CYL,1) 

//SYSIN DD DSN=EKHIC.LSLA5EL,ONIT=2ttO0, 

// DISP=OLD, 1ABEL=&LABEL,v6l= (, RETAIN, SER=DKWIC) 

// PEND 



DKHICASM assembles one of the 360/t3AL ' source programs 
frow the distribution tape and places the object program on 
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a direct access device. This data set may be referenced by 
DSN=*.s^tepna«e*C?!P* SYSGO. The prograi assembled- d^p en ds^ 
up^n^the-^LABFIr^yar^aieter which must be supplied when the 

o 

procedur^ is called. 

To load the object or load jaodules of the DKWIC 
subsysteas: 



//DKHICLD PBOC DNIT=23ia ^ 

//LOAD EXEC FGl!=IEHHOVB 

//SISPRIHT DD SYS00i=A 

//DDI DD OHIT=fiOHIT, DISP=OLD, ?0L=SER=6SBP 

//DD2 CD UNIl=2a00;DISP=OlD,¥OL=(,RETAI»^SEfi=DK«IC) , 

// ' CCB= (BECPH=FB^LRECL=80,BLKSIZE=80b) 

//SYS0T1 DD UNIT=SnHIT^DISP=OLD,VOL=SER=&SER 

// PEHD 



DKWICLD is a procedure skeleton which can be employed 
to load the partitioned data sets containing either the 

J object or load aodules to direct access storage* The SER 

y 

parameter is required and oust specify the volume name of a 
direct access voluaie. The UHIT paraneter nay be overridden 
to supply the correct direct access storage class* A 
LOAD.SYSIN dd- statement must be supplied^ fibllowed by the 
proper IBH10VE commands for. the data set to be loaded (sec 
section C* 3) • 

To linJc any of the object modules into load form: 



' //DKWICLNK PROG DIIIT = 2.11£} 

//LINK EXEC PGn=IEWL,FARM=XREF 

//SYSPRINT DD SYSO0T=A 

//SYSLMOD DC DSN = CKWIC. I(JDEXLIH,DISP= (NEH,KEEP) , 

// ONIT=eU>JlT,SPACE= (TRK, (80, 5, 2) ) , V0L=SER=6SRR, 
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//- DCB= {RECFM=0,BLKSIZE=3400) 

//SYS0T1 DD UNIT = SYSDA,SPAC(E=^(CyL,2) 

//SYSLIB DD DSN=SYS1* PL1LIB, DISP=SHR 

//SYSLIBI DD DS1l=DK»IC.0BJECTrDIS^=0LDr 

// UNIT=SUNIT^VCL=SER=^r/SEB 

// . PEND 



tKUICLMK foms lead aodules frca the object partitions 
of the data set DKilC.OBJECT and places thea in 
DKilC. INDEXLIB. The SEE parameter uDUst specify the direct 
access voluae serial nuiber of the previously created data 
set DKHIC. OBJECT. The load sodules will reside on this sa«e 
voluae* The UNIT paraaeter may be overridden to provide the 
correct direct dicc^ss storage class* A LINK.SYSLIN dd 



stateaent aust be supplied followed by the proper linkage 

i . 

editor contrcl stateaents to link the desired object aodules 
froa DKHIC. OBJECT (see section .C. 3) . 

To execute tire KUCC DKIilC generator: 



//KiODKBIC 


proc 


UNIT 


=23ia 






//DKilC 


EXEC 


PGH= 


KSICDKWIC 




//STEPLIB 


DD 


DSH = 


DK8IC . INDEXLIB, DISP=SHR , UNIT=60 NIT, 


// 


VOL= 


SEH=&SER 






//SORTLIB 


DD , 


DSN= 


SYS1. SORTLIB, 


DISP=SHR 


//SYS PRINT 


DD 


SYSOOT=A 






//ST SOOT 


DD 


SYSO0T=A 






//SORTIN 


DD 


UNIT 


= SYSDA, 


SPACE= 


(CYL, (2,2)) , 


//. 


DCB= 


(BECF 


M=¥B,LRECL=8LBECL,BLKSIZE=6BLKSIZE) 


//SORTOUT 


DD 


UNIT 


= STSD», 


SPACE-- 


(CYL, (2,2)), 


// 


DCB= 


♦.SORTIN _ 






//SORTWK01 


DD 


ONIT 


=SYSDA, 


^ACE= 


(CYL*2) 


//SORTHK02 


DD 


UNIT 


= SYSDA, 


SPACE= 


(CYL, 2) 


//SORTSK03 


DD 


UNIT 


= SYSD°A, 


SPACE= 


(CYL, 2) 


//SORTHKOa 


DD 


UNIT 


=SYSDA, 


SPACE= 


(CYL, 2) 


// 


PENDj 
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KHODKWIC calls the K»IC DKWIC generator into execution,.r 
The SER parameter specifies the volume serial number of the 
direct access volume ccntaininig DKWIC, INDEXLIB, The UNIT 
parameter 'may be overridden to provide the correct direct 
access storage class. A DKtfIC, INPUT dd statement must, be 
supplied to indica.te the source data to be indexed; a 
DKHIC.SYSIN ...dd stateaent aust be supplied to indicate the 
location of stoplists; a DKilC. SELECT dd statement locates . 
the actual aain term selections; if an authority list is to 
be used, a DKMIC.ADTHfiL . dd statement must specify its 
location. The default parameters for the generation process 
■ay be overridden by coding PARM.DKWIC=» {)arameter list • 
{see section C.4).. The parameters LRECL and 3LKSIZE must be 
supplied and are described in section C.4. 
■fi To execute the KBIC DKMIC generator/ 



//KWIDKHIC 
//DKHIC 
//STEPLIB 
// 

//SORTLIB 

//SYSPRIMT 

//SYSOUT 

//t»RIME 

//SECNDFTf- 

//SORTIN 

//SORTOOT 

//SORTMKOr 

//SORTWK02 

//SOR'THKO.3 

//SOSTWKOa 

//INDEX 

//?1ASTER 

//HARKS 

// 



PROC UNIT=23ia 
EXEC PGH=KHIDkmc . 

DD DSi|=DKWIC. INDEXl:.IB,DISP=SHR,UNIT=6UNIT, 
V0L=SEB-6SER 

DC DSN=SyS1.S0HTLIB,DISP=SHR 
SYSOCT=A 
SYSOU'r = A 

UNtl = SySDA, SPACE= (CYI., (2,2) ) • 
UNIT=SYSDA,SPACE=(CYL, (2,2) ) 
UN;iT=SySDA,SPACE= (CYL, (2,2) ) 
UN,IT = SYSDA,SPACB=(CYL, (2,2) ) 
UKIT = SYSDA,SPACE=(CYL,2) 
UNIT=SYSDA,SPACE=(CYL,2) 
UflIT=SYSCA,SPACE=(CYL,3) 
0NIT=SySDA,^SPAC£=(CYL,2) / - 

UNIT=SySDA,SPACE= (CYL, (2,2) ) 
SYSCOl-A 

UfjrT = SYSDA,SPACE=(CyL, (1,1)) 



CD 
DD 
DD 
DD 
DD 
DD 
DP' 
Dt 
DD 
DD 
DD 
CD 
DD 

PIND 
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KHIDKWIC calls the ,^|C«IC DKWIC generator into execution. 
The SER paxaaeter sp^ififies the volume serial number of the 
direct access vplunfe containing DKHIC. ISDEXLIB. /The UNIT 

* • / ' 

paraaeter aay be overridden to provide the correct direct 
access storage class, A DKtfIC, INPUT dd statement must be 
supplied to indicate the source database to oe indexed; a 
CKHIC.SYSIN dd statement locates the stoplists; if an 
authority * list is used, a DKWIC/aOTHEL points to the data 
set containing the word control list. The default execution 
time ^ paraoeters for the index generation process »ay be 
overridden by coding FARM, DKMIC= • paraaeter list • (see 
section C, 5) . 

To generate an authority list froai a source data set to 
be indexed: ■ 



//AUTHBL . PROC 0HIT=2314 

//DKmC ■ = EXEC PGM^iOTHLIST 

//STEPLIB BD- DSN=DKWIC.INDEXLIB,DISP=SHR,UNIT=6aNIT, ^ 

// ■' ■ V0L=SER='6SER 

//SYS,PRIST DD, \STSO0T=A 

'// ■ ■ ; PEND \ \ ' , ■ 



AUIHRL calls the authority list generator into 
execution, . The SER parameter specifies the volume serial 
number of the. direct access ' volume containing 
DKWIC. INDEXLiBi *- ^The UNIT parameter may be overridden to 
provide the correct direct access storage clasps; \ 
DKWIC. INPUT dd statement is required to indicate the source 
data base to be indexed; a DKUlC.SYSIN dd statement locates 
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the authocity list exception tables; a DKWIC. AUTliPL ""dd 
stateaent iientifies the location of the authority list to 
be created; a DK»IC,TITlE dd ' stat^tsent specifies the data 
set on which the da,ta base, converted to internal form, is 
placed. The default . execution tiae parameters for the 
•authority list construqtion can be overridden by coding 
PAR«.UK«IC=» parameter list (see section. C. 6) . 
C.3 In§tal4in3_The_JKHIC_I 

The siaplest installation of the Dj(WIC indexing 
subsystens is to , use the load module provided on the 
distribu'tion tape. \o install this system-; the^ folloving, 
JCL model cafi be. em ployed: 



V/. . . * JOB 

the JGL procedures of section C.2 
//NOVLIB EXEC DKilCLD, SBE=SYSLIB, 0NIT=231 4 • 
//LOAD.SYSIH DD * 

COPY 'PDS^DKHIC. INI)EX1IB,TC=23 U=SYSLIB, PR0N=2iW0= (DKWIC , 12) 



Assumptibns; - . . ' . 

1) the direct access storage to be used are 231 4 's (the 
blocking is such that 2311 's nay be substituted) 

2) the PDS DKWIC..INDSXLIB is, placed on the volume named 
SYSLIB (c-hange name as appropriate) and this volume has 
at least 80 tracks (in the case of 2314) of available.- 
space, and does -not already contain a data set named 
DKWIC. INDEXLIB. 
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The SEE and jUNIT paraaeters and the TO=unit=ser should be 
changed to jthose names used by the particular installation. 
Once DKWIC. INDEXLI3 has been loaded, the 'indexing 

i 

j 

^rrocedures of section C.2 can use this data set as a step- 
1 , ' ' ' 

library. 

4 

Should any of the source aodulHS be changed or a new 
uata ^ase interface be written, some of the nodules cay 
require recoapilation and linkage editing, ^ The .first step 
of this process should be loading the object partioned data 
:^et. the followinn JCL model can be enployad: 



//... JOE 

y the JCL procedures of section C.2 

//«OVO?J EXEC l)K»IClD,SEB=SYSLTB,UNIT=23ia 
//LOAD.SYSIN DD * 

COPY PDS = DKHIC. OBJECT, T:O=231U=SYSLIB,FROH=2a00=: (DKHIC,1 1) 

/* ^ 

// 



AssuBptions: . 

1) the direct access storage to be U£j?d are 2312l's <the 
blocking is such that 2^11 's may be substituted) 

2) the PDS DKHIC. OBJECT is placed on the volume named 
SYSLIB (change naae as .appropriate) and this volume has 
at least 32 tracks (in the case of 231^) of available 

! ' ' 

space, and does not already contain a data set named 
DKWIC. OBJECT* 

The SER and UNIT parameters and the Ta=unit=ser should 
be changed to those name$ used by the particular 
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installation. ^ 

In order to replace oiie of\ the Beabers of the 
DKWIC. OBJECT data' set, the aeiber to\be replaced oiust first 
be scratched and then added to the dat^^ set. The following 
JCL -model first scratches the| neabers \^K»1C3 and DKWICU and 
recompiles then from the distribution ta\^e: 

\ 
\ 

//... JOB j 

the JCL procedure^ of section C.2 
//SCRATCH EXEC PGH=IEHPROGM ; \ 
//SYSPRINT DD SYSOO'i=A ' \ 
//DDI DD- ONII = 23ia,DISP=OLD/VOL=SER\=SYSLIB 

//SYSIN DD * \ 
SCRATCH DSNAWE=DKWIC.CBJECT,VOL=2314=SYSLIB,(iENBER=DKHIC3 
sdRATCH DSNAME=DK»IC.OBJECT,VOL = 2'31 a=SYSLIB , f!EMBE8= DKWIC4 

//hZn3 I EXEC DKMICASN,LABEL=3 . 

//CMP.SYSGO DD DSN=DKWIC. OBJECT (DKHIC3) i 

// ' DISP= (flOD,KEEP) ,UNIT=23ia,?OL=SE!l = SYSLIB 

//COBPa EXEC nKWICOMP,LABEL=a 1 

//CNP.SYSLIN DD DSNi'DKHIC. OBJECT (DKWIca) ,1 

// DISP= {BOb,KEEi?) , 0NIT = 231 U , VOL=SERf SYSLIB 

// -• 1 

i 
I 

Assumptions: 1 

, , . ' i 

1 ) the direct access storage used are 2314 • s {the 

i 

blocking is su,;h that 2311 •s may be substjituted) 

2) the PDS DKHIC. OBJECT exists on the | volume naoied 

SYSLIB ' i 

< . 1 

The relationships between the object and execution 

^ ' ! 

forms of the programs is given below to direcit the linkage 

editing required. ^ j 

i 

i 

1 
! 
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DKWIC.INDEXLIB DKWIC. OBJECT partition 
narne g ames required 



KHODKWIC 



AUTHLIST 



KHIDKHIC 



MEHDKHIC 



SELECT 



HASK 



HEB6B 



PRINT 



DK»IC1,DKilIC2,DKHIC3 
DKWICa,DKHIC2,DKHIC3 
DKIIIC5 

DKMIC6, DKW IC2, DKH,IC3 
DKBIC7 
DKHIC8 
DKHIC9 
■ DKiridO 



description of 
l oad Bod ule 

KIIOC DKWIC index 
generator 

authority list 
generator 

KHIC DKWIC index 
Bonitor 

aaxioa sain term 
generator 

actual main term 
selection 

Dodify maxinal main 
terns 

create actual 
subordinate terms 

print DKHIC index 



The following JCL model may be used to create part or 
all of the data set DKHIC. INDEXLIB from object modules: 



//... JOB 

the JCL procedures from section 
//LINKLIB EXEC DKWICLNK, UNIT=23 14 , SBR=SYSLIB 

//LIHK.SYSLIH DD ♦ 

irCLUDE S YSLIB r (DKHIC 1,DKHIC2,DKHIC3) 
NAME DKHIC (fi) 

INCLUDE SYSLIB1 (DKHICa , DKHIC 2, DKH IC3) 
NAME AUTHLIST (H) 
INCLUDE SYSLIB1 (DKHIC5) 
NAME ATODKHIC(B) 

INCLUDE SYSLIB1 {DKHIC6 , DKH IC 2, DKH IC3\ 

NAME NEHDKHIC(B) 

INCLUDE SYSLIB {DKHIC7) 

HAME SELECT (H) 

INCLUDE SYSLIB1 (Df-SICS) 

NAME MASK{R) 

INCLUDE SYSLIB1 (DK8IC9) 

NAME MfifiGE{R) 



INCLUDE SYSLIBI (DKWIC10) 

NAME PBINT(R) 

/♦ 

// 

Assumptions: 

1) the direct access storage used are 23l4«s. (the 
0 blocking is such that 2311 's may be substituted) 

2) the data se€ DKHIC. OBJECT exists on the volume naiaed 
SYSLIB and all TO neibers are present 

3) the data set BKWIC.INDEXLIB does not exist on the 
volume naaed SYSLIB but will be created by this job. 

If only a portion of the load modules are to be created 
only those particular INCLUDE and NAHE statements need to be 
retained. If ^DKBIC. INDEXLIB already exists, the SYSLHOD dd 
statement of the procedure may be overridden by inserting 
the following dd statement after the EXEC card: 

//LINK.SKSLHOD DB DSN= DKBIC. INDEXL IB, DISP= ( HOD , KERP) , 
//^ DNIT = 23ia,70L=SEH=SYSLIB 

C . a Ihe KWOCrDKWIC Hybrid Index Generator - uocumentation 

The KBOC-DKWIC index generator is divided into three 

logical segments. The user has the freedom 'to select or 

bypass aither of the last two. 

The m itialazation phase is always executed where 

variable length storage reguirements are determined and 

allocated. The stoplists and the authority list, if 

present, are brought into core and sorted. 
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If phase 1 is executed, all potential aain terns are 
generated f roa the source' titles* after the title vords found 
on the authority list have been replaced by appropriate 

preferred vords. The potential aain ter» file is 

I 

alphabetically sorted and searched for identical potential 
aain teras. The EMI and its occurrence frequency are 

4 

printed during this phase in preparation for actual aain 
ten selection which occurs in phase 2. 

If phase 2 is ^ntered, the sorted potential aain term 
fife and the associated statistics file aust be available. 
During this phase, the actual aain teras are selected froa 
the FHT file by aatching sequence nnabers input through a 
selections file. If no selections file is provided, all PMT 
are chosen for the final index. As selections are being 
processed, the* PHT statistics file is interrogated to 
.deteraine When subordinate entries should be perauted^ When 
either all selections have been aade or the PKT file is 
exhausted, the final index is sorted first by the actual 
9iain tera then by the first words of each subordinate entry. 
The sorted index records are then passed to a foraatting 
routine where the index is printed according to user 
specifications. 

C.4.1 KWOC-DKWIC^ Execution Paraaet 

To allow the index analyst aaxiaut flexibility in 
generating indexes, several paraoeters can be *• supplied 
during execution to tailor the index generator to his 
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specific needs. All paraaeters ai:^ found in the PARfl field 
of the EXEC statement (see C,U,5 for exact placement). The 
format of the parameter string is 



PAHfl= 'phases, delimiters , #terminal ^lencode^maxchar, 
maxvord, minpmt, maxpmt, len page, len line, 
threshoId,autostop,maxstoplen,fflaxstopvid , 
-—ser-t-size, fir St page, # columns' 

or « 

PAR«=D 

where 

phase.,- two digit number, NH, directing the program to 
execute the phases indicated; 

N-0 - b^^pass phase 1; - 

N=1 - create potential main terms using temporary 
files. At the termination of phase 1, the 
collated potential main terms reside on the data 
set named by the ddname SOHTOOT. The data set 
^ s named by the ddname SORTWK01 contains the tally 

data printed vith the potential main term list* 
These data sets will 6e destroyed if phase 2 is 
entered directly; 

} N==2 - create the potential main terms, copying the 
/ files necessary for phase 2 onto permanent data 

sets. At the^^ termination^ of phase 1, the 
potential main terms will reside in the data set 
named by the ddname SAVP^ILE and the telly data 
concerning like, potential main terms resides in 
the data set named by the ddname TEMPFILE. 

N=3 - perform the same function as N=1 except do not 
print the PKT list; 

R=0 - bypass phast 2 ' 

H=1 - perform main term selection from temporary files, 
destroying both potential main term and tally data 
sets in the process; create and print the final 
index; % 
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M=2 - perfora aain tera selection from peraauent files# 

Potential aain teras are selected from the data 

set named by the ddname SAVEFILE in conjunction 

with the tally data set naaed by the ddname 

TS?fPFIL^# Create and print the final index; 
* • 

tt=3 - perfora the same function as except do not 

print, the index but calculate the line estimates 
only; 

ll=a - perform the same function as H=2 except do not 
print the index* but calculate the line estimates 
only; 



Default 10* 

Delimiters - i?aryinq length character string; 

the string of alphanumeric characters which make up 
both the terminal and non-terminal word delimiters; 
terminal characters precede non-terminal characters; 

default • «• 

♦terminal - integer; 

the number of characters in the terminal delimiter set; 

default 0* 

Lencode - integer; 

the numbej of characters in the accession number of the 
title data being processed; 

default 0* 

Maxchar - integer; i «^ 

the maximum number of characters expected in a title 
phrase; 

default 256* 

MaX'Word - integer; 

the maximum number of words expected per title; 

^default 50. 

Minpmt • integer; 

the fewest number of words in a potential maia term; 

default U 
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•laxpiat: - integer; * 

the naxisutn nuiiber of words in a potential aain tern; 

default 1« 

Lenpage - integer; 

the number of lines per page; 



default 60« 

Lenline - integer; 

the nuBber of characters per line; ainifflua '20 aaxiaua 
132; 

default 132. 



Threshold. - integer; 

the laxiaua nuaber of subordinate entries posted 
beneath a aain tera in the KHOC-type foraat; 

default !• 

Autostop - integer; 

. - the aaxiaoa nuabec of characters in a word that is* 
autoaatically^ assuaed to belong to the secondary 
stoplist; 

default^ 2# , - * 

Maxstoplen • integer; '^^ 

the aaxiaua nuaber of locations to be reserved for both 
the priaary and secondary stoplists; 

default 0« 

Haxstopwid - integer; 

the aaximua nuaber of characters found in a stoplist 
word; 

default 0. 

Sortsize - integer; ^ 

the number of 1024 bytes of storage to be used for sort 
buffer area; 

defa^ult 20. * , 

Firstpage - integer; 

the number of lines to be printed on the first page so 
that -header inf oraation • can be inserted; omit this 
paraaeter if the first page is to be handled in the 
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saa€ lanner as others; 

icoluans - integer; 

the number of colums laking up the first page; used in 
conjunction with firstpage to create a short first 
page; 

The second fora of the PARM field pgermits parameters to 
be read froa the data set PARH. This data set cust contain 
the parameter string of the first fora oaitting the •♦PARH=«. 

The paraaeters found in the PAR« field aentioned above 

u 

are distinguished only by their position in the paraaeter 
string, ^f the default value of any paraaeters are 
accepted, the user aust indicate the oaission by a coaaa; 
the position of oaitted paraaeters is not necessary^ t#hen the 
oaissions fall to the right of the last paraaeter present in 
the list. In the exaaple belov, 

PARII=S,« •.,/»S2,6,,,126» 
the deliaiters consist of 7.,/" of which the first two are 
terminal; the accession code length is 6; the page length is 
'126; all other paraaeters assuai^ their default values. Note 
that all character strings are enclosed in apostrophes; to 
represent an apostrophe, two consecuJfer^^ apostrophes must be 
coded. 

C. 4. 2 Input_Of_Stoplists To The K^0C_DKMIC Index 
Generator - - - 

Both the primary and secondary stoplists are input to 
the program through the data set associated with the ddname 
SYSIH. Any word input as a member of the secondary stoplist 
IS assumed also to reside on the primary stoplist. The 
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records of this file are assuaed to be 80 characters in 
length with one stoplist word per recoriS. The format of a 
stoplist record is shown is Figure C.I. The type code, a 



I type I stoplist | 
t code I word | 

X 



1 3 sazstopvid " 80 



type code 

01 prisary stoplist 

02 secondary stoplist 



PIGDBE C.I STOPLISI ENTRY FOHBM 



two digit numeric, indicates the stoplist into which the 
designated word is placed; code 01 'indicates primary; code 
02 indicates secondary* laaediately following the type code 
in the third byte of the record begins the stoplist word 
itself* The next aaxstopwid characters aake up the stoplist ^ 
vord. If the word has fever characters than the maximua, 
then the word must be padded with blanks. If* the word is 
Icnqej: than the aaxiaua specified, only the first raaxstopwid 
characters are used* The number of .stoplist records must 
not exceed the aaxisua number specified in the PAB:i 
statement* If the aaximura is exceeded an error siessage is 
printed and processing continues ignoring any remaining 
stoplist woTds. The stoplist >fords may appear in any order^ 
They are i^eparated, sorted, and ^displayed for verification. 



C. U.3 Selecting Actaal- Wain Tei^as Fo r A KH OC-DKWIC 

Phase 2 of DKWIC index generation requires the index 
analyst to choose those lain teras that are to appear in the 
final index. Froi the output of phase 1, a list of sequence 
nua*b^s corresponding to the chosen aain tero)^ is prepared. 
These sequence nuabers are punched into cards in free foraat 
(i.e. at least one blank between nuabers) in ascending order 
and" presented for input in the data set identified by the 
SBLEC^T ddnaae. If this dd statenent^^^^s oaitted, all 
potential maxn_ terms are selecte^d. 

C.4.4 Job Control F or K KWOC-DKHIC Index Genera tion 
Below is a list of all ddnaaes and the required 
attributes cf the data sets used by the program. Note that 
several data sets aay be optionally supplied. 

t 

d dna ae usage 

^YSi>RINT sequential output data set on which all messages 
and the final index are placed. 

IN^Ot sequential input data , o-- on which res_ides^* the 
title data to be indexed. 

AUTHHL sequential input data set on which the authority 

list resides. This Statement is present dnly when 
the authority list is used. 

SYSIN sequential input data set on which reside the 

communication record with - the interfacing 
subroutine (see section C.7) and stoplists. 

k 

optional sequential input data set which contains 
^he parameters for the index generation when the 
^AS!!=D is specified. 



/ 
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SELECT sequential input data set 'used durinq phase 2 to 

input the sequence numbet denoting the actual ,ttain 
terms. If this dd statement is omitted^ then all 
potential main terras are selected if phase 2 is 
entered. 

SAVEFILE sequential data set on ifhich the potential main 

terms are copied during phase 1 only when the 
first digit of the phase parameter is 2. This dd 
\ statement defines the input potential main t/erm 
data set when the pha^e 2 option is set to 2 or 4. 
jLR£CL=MAXCHAR^LENLIHE/2+LENC0DE*55^ 
^BLKSIZE=N*LRECL*a) 

TEMPFILE sequential data set on which .the tally of like 

potential main terms are placed by phase 1 when 
/the first digit of the phase parameter is ket to 
2. During phase 2 thi;^ data is used to ijiput the 
tally information if the second digit of the phase 
parameter is set to 2 or 4. * / 

SORTLIB the system sort library. DSN=SYS1. S0RTL^B,DISP=i5BR 

SORTIN sequential data set which is used as a temporary 

input/output file during sorting; LBECL and 
BLKSIZE must be identical to SAVEFILE. 

/" 

SORTOOT teaporary input/output dajba set^Qsed for sorting 

procedures. LR2CL and BLKSIZE should be identical 
to SAVEFILE. , / 



SYSOUT sequential output message data set required for the 
SORT/MERGE program. / 



^ORTWKOn work areas for the sort routine. (n 
C.a.5 Sample,JCL^For_AKMOC-DKtfIC Index 



1,2,3 minimum) 
Gener^at: 



. //.-.. JOB I 

the JCI- procedures ! f rpm section C. 2 

//GEN EXEC KMObKWIC,UNIT=23ia, SER 

' // bKWIC.'PAB«= 'parameter list' 

// " WEC^^^described above,. 

// ^LKSIZE=n*LR£CL+a* . . 

//DKWIC.INPOT DD ♦ • . 

title, data to be indexed 

//DKWIC. AUTHRL DD DSN=AUtHRL/DISP-OLD 

// yN^=23ia,VOL=SER=SYSLIB 

//DKWIC. SYS^tN DD^ 
, ^ interface control 'card 
/I " stcpiists 



=SYSLI 



B, 



•erJc 



/ 



/ 
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■y/Df^IC. SELECT DD ♦ . 

^ ^ sequence nunbers of selected entries 

C . . 6 Mess age s Issued BY_ The KWOC-DKHIC I ndex Su bsy stem 

DKWIC.OO - VERSICN cc - d ■ i . 

* PHASES dd ' 

DELIMITEFS . . * 

GROOPI ' 
GRO0E2 
' ' ACCESSION LENGTH 

MAXIMUM TITLE (CHAR) 
MAXIMUM WORDS 
MIN- EHT 
, PAGE 
( PAG'E 

PERMUTATION THRESHOLD d / 
/ AUTOMATIC STOP d ' , 

STOPLIST f 
UIDTH dd 

MAXLEN ■ - - .dd / 

the parsing of the parameter field is displayed for 
' verification. 



%EN6IH 
WIDTH 



cccc 

cccc 

dd 

ddd 

ddd 

d MAX PUT d" 

add 

ddd 



,DKHIC*^01 - LINE WIDTH ERROR ^ / 

i ^ ■ 

! the lenlme paraneter was greater than 132 or less than 
I 20; the line width is set to 132 and processing 
! continues. 

bKWIC.02 - NUMBER GRCUPl CHARACTERS > SIZE OF DELIMITERS 

the nutaber of characters found in the delimiter string 
was less than #teriBinalig ; all characters in the 
delimiter string area^soned to be terminal; processing 
continues. ' . ' ^ 

DKWIC.n3 - MIN NUMBER WORDS/MAIN TERM > WAX ' 

' ■ \ . 

the ainifflua nuaber of wards specified to be in a 

_ potential oaiu tern is^ gteater than /the maximua 

specified; the Binimuo nuaj;)er is set-to the inaxiouiB and 

processing continues. 7 . - 

5KMIC.t)a - STOPLIST GREATfeR THJ(N LENGTH SPECLFIED 



the nuaber of stoplist' words founds in* /the SYSIN data 
set was greater than the number expected. ^Only the 
f^irst maxstoplen are considered. 7 ' 
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DKWIC.05 - .PRCGRAM EF.ECB , t CNCODE=DDD D - 

a terminal execution error has been -found by the PL/I 
* • error handler. The ONCODE is listed and'a PLIDOMP is 
initi^atQd if a E11DUMF dd car^ is. present*. 

DKWIC.06 -TOO-MANY CHARACTERS ^IN tECORD - dddd \ 

I ♦ • , ; 

the nufflbeir pf characters in the title wl)Ose Recession 
code,, is I dd'dd is greater than aaxchar.' The*''t7tle is 
ignored aVid processing continues. '/ 

' ' ' - 

CKWIC.07.- TOO MANY WORDS IN TITLE TO PRO.tESSf' - dddd 

the nuaber of woi^ds in the title whose a^ccession code 
is ddld is greater than aaxword. The title i^ ignored 
and processing continues. 

DKHIC.08 - 3CET ERROR 

• • / 

the SORT/HERGE progran returned/a csnditiori code other 
than zero. The sort control cards,- are listed t>elow 
this message. Consult the message data set SYSOUT ■ for 
details cimcerning the error. / Execution terminates. 



DKHIC.10 - PHASE 1 RESDLIS I ' / 

TITLES . dddd 

WORDS _ ddid" ■ - ' " ' 

HORDS/TITLE dddd 

1- STOPLIST , dddd - 

2- STOPLIST dddd — 
TOTAL PMT . • __iddd 
ONIQUE PHT " dddd 

TOTAL PMl/TITLE dddd ' " 

' CHAHACTERS/TITLS dddd / 

UHARACTERS/REH TITLE dddd" . 

, f ' . . ■ 

phase 1 has been completed and the results are- postoi. 
for verification. 



DK.*IC.2p - PHASE 2 HE3UIT5 

. ACTUAL i'aik ter:is , 
pf;r«uted type 

* TITLES 
# ENTRIES 



# LINES 

K woe -TYPE 

♦TITLES 

#ENTRIE3 

ILINES 



ddld dd.d^ 

dddd dd.dd' 
dddd dd.dd 
dddd dd.dd 

dddd dd.dd 
dddd dd.dd 
dddd- dd.dd 
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phase 2 has been completed and the results are 
displ^^yed for inspection. The statistics are grouped 
by the type of entry; each entry is given as the raw 
number of occurrences and the percentage of occurrences 
' in the final index. . , 

DKtfIC.30 - SIZE ESTIMATES - LINEHIDTH ddd - PAGE ddd 
TITLES/ENTPY MAIN TEFRS EST KWOC EST DKHIC 

d dd ddd ddd 

d dd idd ddd 



The number of main terms (MAIH TERMS) having N ti-ies 
(TITLES/ENTRY) is displayed along with an estimate of 
the number of lines in the index tltese entries will 
produce if the entry is formatted as a KWOC-type (EST 
KWOC) or DKWIC-type (EST DKWIC) . The linewidth and 
pagesize "are also printed for reference when making 
calculations of the number of pages of index. 

C . 4 . 7 Kgpc-DKWIC .I n dex Subs ystem ^IpBlgjentation 
Restrictio ns 

The KWOC DKWIC generator operates under full OS/360 
operating system. The program is written in PL/I version 
5.2 and requires a minimum of 126K bytes of core to operate 
effectively. If the stoplists and authority list become 
exceedingly large^ this minimum, will not be sufficient. The 
program direcctly .calls the system 360 SORT/MERGE facility 
to handle variable length record sorts. * 
C . The_KWIC-DKj^C_H^fcrid^Index^ 

The KWIC DKWIC generator produces an index thro'ugh the 

execution of five phases^ implemented as PL/I subprograms 

■J I 

called by an assembly ^language submonitor. Each of these 
phases may be selected or bypassed under user con.trol. 

In the first step^ all maximal main terms are generated 
from the data base. The specificity of each MMT as well as 
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c 

each specificity unit fcoundaj^y is written vith each record* 
These recoris are .tagged with an internal sequence number 
which represents the relative record position of the title 
which is kept in internal format in another file, A data 
set of pointer records is also generated for this title file 
which contains information to - locate all words in the 
correspondi^ title .aiit^P indications of stoplist 
characteristics. The maxinal main term file is then sorte'1 
alphabetically and passed to the selection program. 

The mctzimal main term file is passed sequentially by 
the selection progran where MNT statistics are gathered and 
the PHT tree is built for each «MT beginning with the same 
initial word* After each tree is built, it is examined for 
maximum and sinimum posting criteria. At this time pointers 
into the ' MMT file ar-^ created accompanying the actual 
specificity and count ^f the number of titles containing th' 
actual main term. \ 

The transf ormatian of the maximal main terms to actual 
main terms occurs in the next step where the MAT file, the 
specificity and occurrence files are passed in parallel. 
Each maximal main term is reduced to the" specificity 
indicated by the corresponding pointers. The user supplied 
subojcidinate permutation threshold is matched with the 
frequency of occurrence of each main term and a marker 
concerrinq this decision is placed in the actual main term 
record before it is written on a main term file. The ntain 



tern file is titen sorted by the internal title sequen^cp- 
nuaber. ' 

The title and associated pointer files.^^re read . in 
parallel matching internal sequence numbers against those 
present in the sain term file. A natch signifies the need 
to form a subordinate entry fro» the corresponding title. 
When the number of occurrences of this main term phrase 
falls below the permutation threshold, the title is rotated 
so the initial word of the main term entry appears as the 
first word of a KMIC-type entry. When the threshold is 
exceeded, all occurrences of the main term are extracted 
from the title. Subordinate entries are generated beginning 
with each word that remains in the title and is not a member 
of the secondary stoplist. When all AHTs have been 
processed, control passes to a program which sorts the main 
and subordinate entries. 

The sorted entry file is then formatted by a print 
routine which exaaiines first the permutation marker to 
indicate whether a KWIC or DKHIC subordinate entry should be 
used. The index entry is then printed according to user 
specifications. 

C . 5 . 1 KHIC-DKMIC Execu tion Paramet ers 

The execution of each of thvi phases of the KHIC DKWIC 
generator is governed by an execution monitor written in 
IBH/360 assembly language. This monitor accepts several 
keyword parameters which supply the necessary Wariable 
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information, tor tailoring the programs to generate a 
specific index. , These parameters appear or. the PAR« field 
of J;he EXEC statement invoicing the DKKIC indexing program 
and take the following form: i 



BRKLIST - varying length , character string 

The set of break characters to be used to discern word 
boundaries in the titles being indexed. The first 
'Character is used to delimit the. remainder of the break 
characters and can be any character not found in the 
list. The set of terminal break characters aust appear 
first in the list followed by the non-terminal ones. 
The break character delimiter separates, these strings 
as well as ends the non-terminal list. Thus, if 
are terminal and are non- terminal, then the 

breaklist is written as 

Q#.:;Q/-Q 

where the breaklist delimiter is Q. The breaklist is a 
positional parameter and must appear first in the PABN 
field. If the entire list is ^omitted, it must be 
represented by a comma. Two successive breaklist 
delimiters are interpreted as a null string. Default 
CQ Q denoting no terminal break characters with a blank 
being the only ncn-terminal . A blank is automatically 
supplied to the user even when a breaklist is 
specified. 

Default QQ Q . * . 

\, 

CODE-lencode 

the length of the accession code; 

c default CODE=0 

SPBC=maxspec ' . ^ 

the maximum specificity of a maximal main term; 

default SPEC=3 

STOP= {autostop,stopwidth,maxstoplen) 

Autostop - the maximum number of characters 
automatically assumed to be members of the 
secondary stoplist 
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5ioptfidth - the number of characters in the longest 
^ stoplist word ^ 

Haxstoplen - the saxiauti nuaber of vords expected on 
the stoplist 



Default STOP= (2,0,0) 

PCST= (saxpostyBinpost) * - 

Maxpost - the vaxiitti nusber of titles ^o Be posted at 
a particular specificity ^ ^ 

fliupcst - the linxBUi nuaber of titles to be posted at 
a particular specificity 



Default P0ST=(4,,^2) 

P&G£= (linevidth, pagelengt h, reserved, nuacol) 

Linevidth • the nuaber of characters per line 

Pagelength - the nuaber of lines per pag^ 

Reserved - the nuaber of lines (full page width)' 
reserved on the first page of the index. This 
paraaeter allows the user to print a short first 
page. 

Nuacol - the nuaber of coluans expected on the first 
^ page " ^ 



Default PAGE= (132,60,0,0) - 

PEPH=threshold 

threshold - the naxiaua nuaber of titles forming a 
group of siailar aain tera entries which will be posted 
as KWIC entries in .the final index. 

Default PERH=2 

FORM= (pages, chars/ccl , colsep, res,or ig, ain , aax, wid, len) 

The FORM paraaeter is used to specify autowatic 
foraatting specifications. If this parameter is present, 
the PERM and PAGE paraaetets need not be specified ^?ince 
those parameters are calculated by the autooatic foraatting 
routine^ 
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Pages - the aaxifluin acceptable number of p^ges. allowed 
for the iadex* The numeric specified must include 
partial first and last phages. 

Chars/col - the ainimu« acceptable number of characters 
per line per column in a printed entry in the 
final index* This numeric incluies t1ie hunber of 
characters in the accession code but does not 
include the number' of blank characters^ between 
columns* * • 

Colsep. - the number of blank characters to be inserted 
/ betneen columns vhen the final index is prepared 
for Vhotore duct ion* 

Res - the number of lines (full page nidth) to be 
reserved on the first page of the index. This 
parameter allows the user to print a short first 
page. ^ \ - , 

Orig - an integer between 0 and 100 which represents 
the minimum acceptable \ percent of original size 
for the final index. 

nin - the minimum acceptable permutation threshold. ^ 

nax - the maximum acceptable permutation threshold. 

Hid - the width of the field in lOths of an inch onto 
which the photoreduced copy of the index is to be 
fitted. ^ 



Len -.the length of the field in loths of an inch onto 
w^ich the photoreduced copy of the index is to be 
fitted. 

Default jPORM= (0^50^5^0^60^2^20^75^ 100) 

PHASE=execphe^s€ ^ 

an integer representating the phases to execute 

1 - phase 1 

2 - phase 2 
4 - phase i 

8 - phase 4 . 
16 - phase 5 

execphase is the sum of kll. or c^ny of, these quantities. 
The phases are always executed in order. 

Default PHASE=31 
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iiith the exception of BRKLIST, |t the parameters are 
keyword oriented and can appear in any order. The multiple 
arguments of keyword parameters are positional. If the 
default values of these parameters are to be assumed, their 
position must be indicated by a comma* For example, to 
change just pagelength, the^PiGE parameter is coded 

S4GE={, 120) 

The first two letters of any keyword can be used as 
abbreviations cf any cf the parameters mentiofied above*. 

If the parameteV field is too large to fit onto the . 
EXEC card, substitute the tford CABD for the parameter list* 
The parameter field is then read from up to the first two 
card images of the data set associated with the ddname PARK. 
The parameters are punched in the same keyword format 
described above, dropping the , opening and closing 
apostrophes. 

C.5.2 Input Of Stopli sts To The KWIC-DKMIC Index 
Gene rator ~ 

The stoplists for the KWIC DKHTC generator are input in 

the same manner and form as the KWOC DKHIC process (see 

C.4.2). 

C . 5 . 3 jgb_Control For KWIC-DKHIC Index Generation 
Below is a list of all ddnames and the^ required 

attributes cf the data sets used by the program. Note that 

several data sets may be optionally supplied. 
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DDNAME OSAGE ^ . 

SYSPRINT sequential output message data set 

SY5IN sequential input data set/from *#hich the data-base 
interface control and stoplists are read 
(LRECL=80) 

'INPUT sequential input data set £ro« which the. data base 
of titles is read 

AOTHRL optional sequential input data set on which resides 
the authority list ^ created by the word 
transforaaticn routine 

PRISE sequential data set on which the titles in internal 
foraat are placed' for later reference 
(LRECL = 30i»/BLKSIZE=33a8^BECPH=VB) 

SBCNDBY 'sequential data set on* which pointers- to all words 
\ found in the corresponding PRIHE title record is 
placed for later use (LPECL=iaa,BLKSIZE=ia40^ 
R&CFH=^FB), 

<> 

SORTIN- sequential data set which is used as input to the 

standard sort package. This data set is used by 
three of the phases for output, changing the 
r RECPB, LHECL/ and BLKSIZE characteristics each 

tiie* Do not specify DCB characteristics for this 
file. 

SORTOOT sequential data set which is used to hold the 

output froa the sort program. This data set is 
used as input to four phases of the indexing 
operation and ^should not contain DCB 
characteristics. 

SOHTHKOn sequential data sets defining sort work areas 

(n=1,2,3 « ainimua) . .The statistics- for .the EBT 
tree are kept on one of these data sets. 

SORTLIB the sort library for- the standard sort program. 
^ The'index generator requires exits, E1S and' E35. 

SYSOUT sequential output message data set used by the sort 
routine 

BASK temporary data set U)Sed to hold the selection 
markers generated By the auto-select routine. 



IHDEX 



sequential output data set onto which the /final 
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index is placed prior to for»atting» 

HASTER sequeivtial, output data set onto which the final . 
foraatted index is placed* 

PARM optional input data.^et describing, an alternate 
pardieter list input streas 

C.5. 4 3aaple JCL For .KHIC-DKHIC Index Generation 

//... JOB - ^ 

the JCL procedures of section^ C. 2 
//ADKHIC EXEC KiUDKillC, 
// PAE!!'.DKIIIC=» parameter list • 

//DKHIC.SYSIN , DP '* , " ' . . 

interface control card ^ 

stcplists 
//DKilC.IHPUT DE ♦ 

data base cards 
// i . 

'c.5.5 Messages Iss u ed By The. KHIC-DKHIC Ind ex 

S ubs ystea " ^ 

DKHIC.O'O - DKifIC INDEX - VEHSION V - n 

BHiEAK CBAt:.CTERS " " 

TYPE 1 1111 

TYPE 2 1111 ^ 

CODE lENGTH nan 

HAX SPECIFICITY nnn • . 

AOTOHATIC STOP nnn ^ _ - 

WDTH ' . nnn . " 

BAXL'EN nnn 

An echo of the • parameters input to phase 1 are > 
presented for verification. 

DKWIC.01 - STOPLIST GREA-fEB THAN- LENGTH SPECIFIED 

The naxlen. parameter specified a number less than the 
total number of vords presented for the entire 
stoplist* Execution continues with the first aaxlen 
stoplist vords* 

DKHIC.02 - TITLE RECOEE IGNORED; LENGTH EXCEEDS MAX 

A title record containing more than 300 characters 
including accession code has been found an is printed 
under this message* The title record is ignored and 
processing continues. 
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/DKHIC,03 - TITLE BECOBD IGNORED; MAX M0EDS/TI7LE EXCEEDED 

A title record containing aore than 32 words has been 
detected , and printed belov ±his message, the title has 
teen ignored and* processing continues. 

D!JWIc!0V - PBCGBAM EEFBR, ONCODE = nnnn • ■ , \ 

^ A serious error has occtirred -during the execution^ of 
the prograa. The> cbndi'tioa is described by the oncode 
nuneric. This message' usually follows a- acre 
descriptive error indication printfed by the PL/I error 
handler. In event the e^rror handler abnormally 
terainates, /the error can be" determined by consulting 
the PLI Reference Manual for oncode conditions.' 

DKWIC.05 HMT STATISTICS 

• NOMBEB OP TITLES- ^ nnnn • ' • . ■ - 

NOMBEB OF BORES nnnn 

WORDS ON SEC STOP nnnn ' 

BORDS ON PRIH STOP' nnnJl 

1- ARY MAX HAIN TERHS ivnnn 

2- ARY MAX MAIN TERMS nnnn 



The statistics for MMT generation ace presented for the 
user. This aessaqe is printed during the final step of 
. phase r. ' 

■■■>.- 

DKWIC.10 - SELECTION CRITERIA ' * 

MAX POSTING nn 

BIN POSTING nn ' . ' . • , 

The ■axiaam ^nd liniaun posting lioits are ■ displayed 
for user verification as phase 2 is entered: 

DKWIC.12 - SELECTION STATISTICS , 
PHT TFEES nnnn 

1- ARY MT-- - nnnn • - . 

2- ARY «T nnnn " ^ - 



The statistics, for the selection phase are presented 
for the user. The number of PMT trees examined and the 
number of selections aade at each MT 'specificity' in 
displayed* 

\ r 



DKWIC.13 - ISDIX SIZE ESf^ISATES . ' - ' ' 

•MTLE/GSOUP' ,*' NUfWEB EST Ki*IC ..EST DKHIC TOTAL RSI 
OR. THRESHOLD GPOUPS LINES' LINES" LI!IES 

1^ ^nn nnn ' nnn n'n.n 

2 *nnn nnn nnn ' nn^n 



•v' An^ esliaate of the size of the index to be ptinted is 
. . 'displayed « The nuaber of aain tecas contained £n 
precisely n titles is fourfd in the nth entry under 
TITLE/GROOP if the threshold is a then: the nuaber of 
titles which Will f dr^>DKWIC-type entries ar.d KWlc-type 
-entries are displayed beneath EST DKiic LINES aad EST 
KWIC LINES respactively^t ' Froa the averages concerning 
>^ AMT specificity, .words/title,, and secondary stoplist 
criteifia, an estiaate-of the nuaber . of^ lines in ' the 
index i§ presented for each "€hreshold value. 

DKWIC.aO - DKHIC ENTHY LARGEB THAN MAX .\ ' • 

An entry has been cenVrated which exceeds the ma^cimuin 
record'; length. ^ The record; displayed below- 'this 
aessage, is- ignored^ and processing continues. ^ ' The 
<0uaber of cbaraicters in this record after the aain tera 
^ has been extracted aust be shortened to be accepted. 

. C. 5.6 KjlCrPKHIC Index Subsystea Impleaentation 

Restric tio ns * ^ - * 

1) A aaxiaua of* 300 characters has been allocated for, 
any title cf the data fcas,e and any index ite»- teaporarily 
stored by the/pro^raa. The program detects this condition 
and ignores such K€cor.ds informing the user of the action. 

2) A single title cannot contain more than 32 words as 
defined by the word delimiter set. The program detects thii^ 
condition apd i^gnores such records inf ormrng-.the user of the 
action.' \ ^ 

3) * All maximal main terms are truncated to 50 
characters without warning. 
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4) The iJCcgraa requires 126K bytes of core to execute 
effectively. Hhen lar^e stopii^ts and " authority lists^'are 
, used^ 126K bytes may b€ inadequate; - ^ 

5. The proqrd.ffls operate under full OS/360 'and directly 
call the , systea sort package^ for fixed ^nd variable length 
record .sorts. * - •* v» 

C.6 The Authority List Generator - .Docunentation 

The word transformation routine is embodied in 'a 

\ / ' ^ ' 

program s.^parate froa any indexino routines' and is intended 
to^e execirted as a preprocessor' of the titles being 
indexed. Th€ inputs consist of the data' bise and 
appropriate exceptions lists; the otftput, the'^authority list - 
rea^y to be used by the indexing routines. 

C. 6. 1 Authority Lis^ Execution Paraaeters 
To effect generality,* several ^pai-aaeters regarding thr 
estiiate of array and s-tring sizes are aade available to the ■ 
user so as not to liait the usefullness of the program. The 
^paracietfer list aust supplied on the EXEC^card descri'bing 
the authority list generator, it is of the' form: 



PaRM = • LCODE, BRKLIST, LISI^N' 



where 



LCODE - integer y 

; length of the aG6essiqn code of this data base 



default 0 



BRKLIST - character string 

a list of the ch-aracters to be used as- word delimiters 



1 



ERIC 
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' default • • ' .... ^ 

LISTLEN - integer 

.J the aaxiaum^ number of words expected on the authority 
iist. 

default 100 ■ • ^ ■ 

The paraaeters found in the PARM field are 

distinguished by their position only in the parameter 

\ 

string. If the default value -\of any parameters are 



accepted, the .user must indicate Xhe omission by a comma; 

the positions .of omitted parameters is\not necessary when 

the omissions fall, to the right of \^he last pararoeter 

present in the list. Character strings included in the PAR« 

p ' ^ ■ . 

field must be enclosed by pairs of apostrophes. 

C.6.2 Authorit y List Exceptions List In put 

All. exception lists are entered through the SYSIN data 

set. Each exception li^t word is punched, one word per 

gard, following a two fcyte nameric list code (see Eigure C.3. 

"tor code numbers and designations) i The words must be 

grouped by exception list code; t^^e words within a single 

exception list can be placed in any order (see Figure C.2). 

p 

The first record of the ^ exception list holds two 
positional pa^rameters which direct storage allocation for 
the lists. T-hese parameters are: 

HAXEXCEPT^WIDEXCEPT 
wher^ ^ _ - - - - ' 

MAXEXCEPT • integ.er > 0 ' * 
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•maxiauM number of words expected for all exception list 
words * 

D* .CEPT - integer > 0 ' ^ 

tnaximuB number of characters expected in the longest 
exception list word 



I I ' I . I 

I LISTCODE I EXCEPTION »ORD | ^ \ 

It! I I 

< ^ ^ , : ; I 

1 3 »IDBXCEPT>&r 80 



Figure C.2 Exception list format 



A review of the exception list definitions and their 
assigned code numbers are displayed in Figure C.3« 
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01 • non-transft)rBable words endinq in . "consonan t-s" 

(e.g. physics, MZDLAHSr etc. ) . 

02 - non-transformable words ending in "vonel-s" 

(e.g. atlas, pathos, etc.) excluding those ending in 
•♦sis'^. 

03 - nonrtransfbraable words ending in "ies'* (e.g. series, 

etc.) . 

04 - irregular plurals ending in "es" whose singulars are 

not foried by dropping the final "s" (e.g. indices, 
et c. ) . 

05 corresponding singular entry for irregular plurals 
found on list 04 (e.g. index, etc..) . 

0^ - transforaable words ending in "sses** whose singulars 
are^ fcrsed' by dropping the final "ses" (e.g. busses, 
etc.)-. 

07 - transformable words ending in ♦•ses" whose singulars are 
foraeu. by dropping'^ €he final "es" (e.g. thesauruses, 
- chorsuses, etc.) . * > 

Figure C.3 A synopsis of the exception list codes 
and their definitions 



C.6.3 Authori ty list Format 

The authority list* p'roduced by this program is an array 
of the singular and plural words transformed by the word 
transformation routine. Each element of the array is irf one 
of two formats, regular preferred word ^ and irregular 
preferred word. 

The 18 bytes of a regular preferred word entry contains 
the singular or plural word which is used to match words in 
the data base (see Figure C.U). Shen a match is found, the 
preferred word is formed by concatenating the preferred word 
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stem (whose offset is given -as a binary integer in the last 
two bytes of the authority list entry) with an ending chosen 
from an arr iy whose subscript is stored in t he^, "endinq 
indicator" byte (see Figure C,4) . ' ,p 



• ■ : 

I ' I I I 

I SINGULAR/PLURAL WORE | ENDING JCHDICATOR | STEM OFFSET f 

1 ' ! ' I I 
« . , I 

1 » * 15 . 18 



Figure 0.4 Regular preferred word forinat^ 



The "ending indicator" is a one byte binary integer 
pointing into an "ending" array (see Figure C.S). The 

preferred word for each entry of the authority list is 

r 

formed by concatenating the word stem with the appropriate 
ending. If tlie word ACTIVITIES appeared in the data bas^^ 
bpth^^ the words ACTI,VITY. ajid ACTIVITIES would appear" in the 
authoiftity list. The "stem offset" of each entry wouLd be 7 
and the "ending indicator" would-be H. '^*-r preferred word 
generated would be ACTIVITY (IE3) for both vhe singular and 



ending indicator ending 

1 (S) 

2 . (ES) 

3 (SES) 

<* ' Y(IES) 

5 IS(ES) 

6 F(ES) 



Figure C. 5 Endings used to Corn preferred words 
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plural concept. 

If the preferred word stem cannot be gf^nerated from the 
sinqualr or plural word, * the "ending indicator" byte 
contains an asterisk and the "stea offset" i)ytes are 
interpreted as a subscript ^into the authority list pointing 
to the preferred word. This irregular preferred word format 
differs from the normal format in that a preferred word code 
corresponding to the rei-ihterpretation of the "stem offset" 
precedes the replacement word* The preferred word code is 
so chosen so that upon sorting of the authority list words, 
this record will be placed in a position corresponding to 
this code. An itrregularly formed preferred word is handled 
in the same mannefr as a regular preferred word once the word 
Stem has been retrieved. 

> To indicate storage reguirements to any program using 
the authority list, the first jecord of the list contains in 
free format the number of words in |;he list as well as the 
number of characters in each record. 

C. 6 . 4 Job_Control Fpr The Authority List, Gene 
Below is a list of all , ddrames and the reguired 
attributes a of the data sets used by the program. Note that 
one data set may beriopt ionally supplied* 
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DCjJAME OSACE * , ' 

SYSPRINT sequential output message data set 

SYSIN sequential in^)i»t data set holding the data-base- 
interface, control card image, the exception list 
control image, and the exception lists . (LBECL=80) 

INPUT sequential input data set holding the titles from 
nhicb the authority list is built 

ADTHRL sequential output data set upon which the authority 
^ list is placed {LBECL= 18,BLKSI2E=360) 

TITLE optional output data set on vhich the titles in' 
internal format are placed \ 

C*6.5 Saic le J G L For T he^u thoyiti^ List Gen erator 

//... JOB 

th€ JCL procedures from section C.2 
//A.LIST EXEC AOTHFl., 
// .U!JIT=SISLIff, 

PiB.M.DKWIC=« parameter list* ^ ' 
//DKWIC.ftOTHBL DD DSN=S&AbTHRL ,DISP= (NBHVPASS) 
// SPACE=(360, (10,10)) 

// ECB= (BBCPn=FB,LRECL=1B,BLKSIZE=360) 

//DKMIC.INPDT CD ♦ ■ 

title data to be indexed 
//pKHIG. SYSIN DD ♦ 

interface control card 

exception list control card 

exception lists - . 

// ' 

C. 6. 6 Mess ag es Issued By The Aath ority List Generat or 

DEPLRL.01 - NOT EDOUGH SPACE FOR EXCEPTION LISTS 

not enough space was estimated on the exception list 
control card for the exception lists input* The 
exception list entries which o^ur after overflow are 
iqncrei. Processing .continues* 

DEPLRL.02 - HOT ENOUGH SPACE FOR AOTHORITY LIST 

not enough space has been estimated for the authority 
list in the PARM statement. All singular entries 
marked «ith an asterisk {♦) have not been added to the 
*list. . . _ 

DSPLP.Ol - THE AUTHOFITY LIST HSQUIRES idd * LOCATIONS , 
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C • 6 ♦ 7 AQthority L ist Sttbsystea Iipl e ientati on 

Restrictions ^ 

1) The Baxiaua nuiber of vords found in a title cannot 
exceed 30. Onpredictabie results aay occur but processinq 
continues. , 

2) the aaxiaum nuaber of characters in a title is fixedv 
at 512* Unpredictable 7:esults_<^n^^^ if this limit is 
exceeded*^ Processing continues* ^^^---^^ 

3) Authority list entries are restricted to 18 bytes* 
The singular or plural ¥ord is truncated to 15 bytes without, 
naming* ^ . 

C * 7 ; Interfacing The. Data Base 

Each indexing subsystea requires that ^title data be 
presented to it in a foraat that is easily aanipulated by 
the index generat^or* The task ot converting external data 
foraats to th« internal fora used by the generator is 
assuaed by an externally coapiled subroutine* whenever data 
m a nev foraat requires' indexing, only a new interface 
subroutine is required* 

Figure C*6 depicts the foraat into which all title data 
aust be* converted* the first LEMCOpE bytes of the varying 
length string contains the accession code for the title 
which follows im^aediately. Mo padding .of the title string ^ 
is necessary. The aaxiaua length of a record is defined for 
each irdexing subsystea* 
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r — 1 

11 . r- 

I ACCESSION 1 TITLE I 

I CCDS I . " I 

I I . , I 

I ^ ; - : ' 

1 LENCODE 



Figure C.6 Inte^rnal title format 



C.7,1 Req uireients Of An In t erface S ubroutine 

To' construct, an interfacing, subroutine^ the following 

conventions aust be fclloved: 

V. The subroutine operates as a PL/I function with the 

following calling seguence and attributes : 



GETBECOHD:^PH6CEE0RE (BUFFER, LENCODE, POINTER) 

REI0B1IS^TT (-1)4 ; 

DECLARE . ^^^^^ 

BUFFER CHAR(*) VAR, — 
LENCODE FIXED BINABY (31), ' <- " 
POINTER POINTER; 



BUFFER - character string to be returned containing the 
accession code and title in internal- for aat. 

LENCODE - fixed binary^ fullword inforMing the subroutine of 
the number of characters in the accession code. 

POINTER - a pointer variable which upon return contains the 
address of the next record to be input by the 
interfacing subroutine. 



2. The isubroutine aust use the ddname INPUT to acguir« 
the title data to be converted. The attributes of INPUT are 

TTECORD INPUTT^^ ' ~" ' . " 
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3* The first call, to . the . subroutine is for 
initialization purposes* Therefore , the subroutine iDust 
have at least one variable in STATIC -storage to indicate the 
called state* During this .^all the sufcroutine may access 
the first BO bytes of the STRBAB file SY5IN for any variant 
inforaation concerning the title fornat« 

ft* The subroutine returns a yes (PETDRJI (M«B)H when 
BUFFEB has been filled with a title. It returns no ;;(EETOBN 
(•0.*B)) when no icre records are available for processing. 
'C*7*2 Che aical T itl es Interf ace Subr out ine 

An interfacing subroutine vhich conyerts the Chenical 
Titles data format to internal fora is included with the 
indexing subsysteas. This foraat wafs ; adopted^" by CHEHICAL 
ABSTRACTS SERVICE and used for all pre- 1971 Cheaical Titles 
source tapes *^ This subroutine handles titles coded in 
either the pre-1971 standard file foraat or the results froa 
a Cheaical Titles search* \ 

The standard record contains RO bytes (Figure C.7) of 
which the first 17 bytes form the accession code. Colum^ 13 
is a typ6 code which indicates how the remainder of the 
inforaation on the card iaage is to be intrepted. Hithin 
each "type**, the records are seguenced in column 19, the 
«S3q" field, beginning with sequence number 1.' Type = 1, 
refers to author records, three authors per card. Type = 2 
ref^s to title-^recoxdsv 'The title beglfmLn 'col uSTrar ir 



the first card. If a second card is necessary, the title 

X 



2.00. 

! 

aust be broken on a vcrd boundary and continued' in column 23 
of the next title card.; Figure C.8 exemplifies a title .in 
this foriat. 

AOTHOH REC03D' 



fill I I - 

I ACCESSldN|TYEE|SEQ|FIRST AUTHOR | SECOND AUTHOR | THIRD A&THOR 
I CODE I 1 * I I I . I . 

J III I I 

» . , •- : 



1 1-9' 19 21 i*! 61 . 80 



TITLE -BECORD 



I I \\ I 

I ACCESSION|TYPE|SEQ| BEGINNING OF. TITLE 
I CODE 12 111 
1 I II 

t ; 



18 .19 21 80 



TITLE CGNTISUATICN RECORD 



I II I I 

I ACCESSION |TYPE|SEQUGNORED| CONTINOATION OF TITLE 
I CODE f I I I - 

I ' 11,1 I 



1 * 18 19 21 ' 23 ■ ^80 



TYPE 

1 - AUTHOR RECORD 

2 ^ TITLE RECCRD 



Figure C,7 Chemical Titles input format 
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1 . '17- 21 i*1 ' SO 
; ■ : — — ' ' -■ ; & '. 

CODEN0011 11 AUTH0H1 AUTH0B2 , I10THOE3 . 

CODENO011 12 AOTECBU . , 

. COITEN0011 21 BEGINNING OF TITLE, NOTE THAT HKEN 

CODEN0011 22 CCNTINOED, THE TITLE IS BROKEN AT 

CODEH0011 23 A HORD BOONDARYV 

Figure C.8 Exaiple of a citation in Chemical Titles format 

The Chemical. Titles answer . format is very, similar to 
the' one just descrilj^rf^tiith the exception of the addition of 
a five byte question numter preceedinq the stardard form and 
a five^byte question weight following^" 

Th€ interfacing stibroutine is capable* of merging any 
record types into a .-.record suitable for indexing. To 
indicate to the subroutine which- types to merge, a^ nonblapk 
character iif the corresponding cdlusn of the first record in 
the SYSIN data set indicates that that type is to be iierged. 
For instance, a character punt'hed.Vinto columns 1 and 2 of 
the first SYSIN record causes the subroutine to <:oncatenate' 
the author and titXe record types. The, first four columns 
are recognized, type two, three, and four are handled 
identically. A nontlank character in column five of this 
same record indicates that the Chemical Title answer format 
.is being used. ' 

The interfacing subroutine replaces trailing blanks of 
an input record with a single blahJc before the concatenation 
of more records. .Any blanks found in an ^author record arc 
replaced by ,X«FF». In this manner, the entire author's nasie 



and initials are treated as ^ single word by the indexing 

it 

routine. The scan of an author •s name is terminated by the 
pccurrence of a contiguous pair of blanks. 
C.3 Word Finder Subrou ti ne ^ 

An asseably language routine-has been implemented to 
speed, the prccess of finding words in phrases of arbitrary 
lengths. The routine contains four entry points, three of 
which are Called by the PL/I iiain program to- initialize 
internal tables before successive calls to the fourth entry 
yield the information for processing the string, iford by 
word . 

The first entry/ INITIAL, clears a 256 byte translate 
table' (TABLE) and musts be called first by any program using 
the routine. 

required declarations * 
TABLE CHAR (256), 
calling sequence 
CALL IMITIAL (TABLE) ; 

^The second entry loads the translate table cleared by 
INITIAL with the* 'word delimiters to be used. The user 
supplies the delimiters (DBLIfllTEPS) in a varying length 
character string variable. A one byte character string 
variable (TYPR) identifies ^ the type of delimiter string 
input. This character is inserted in the translate table 
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offset by the hexadecisai equivalent of each character in 
the delifflitec string* ' 

required declarations 

DELIHITEBS CHAR (N) VAR, 
• TYPE' CHAR (1) 

calling sequence - 

CALL SET (TABLE^DELIMITEBS^.TYPE) ; . . , 

The third entry, jpoinj: is a means of saving some 
execution tine by bypassing some unhecess-iry--dyna«ic loading 
of paraaetec lis This entry point is .used to, pass- the 
parameters concerning th<3 vord string to translate an^ ^the 
arrays which contain the pointers to the words^ in this 
string so that the fourth entry ;^ihich performs the word 
findiifg operation can be' called* without parameters. 

required declarations . 

EOFFER. CHAR (MAXCHARS) > . ! ' j 

(EREAKTYPE,SECSTCP;PRISTOP) CHAR (MAX WOR DS) VAR, 
(OFFSET, LENGTHHORD) (SAXiORDS) FIXED BINAEIY (31). ; . 
STOPLISI(MAXSTOP) CHAR (illDSTOP) , ■ ^ 

(LSEC,LSTOP,ftUT0STOP) FIXED BIHARY(3'1) 

calling sequence 

CALL SBTVAR (BUFFER>TABLE, OFFSET, LENGTHHORD, BREAKTYPE, 
SECS/rOP,PRISTOP,STOPLIST,LSEC,LSTOP, AUTOSTQP) ; 

^Where ' ' ' 

PUFFER - location of the' word string to translate'' - 

OFFSET - OFFSET (I) contains the ' location of the first 



character oJL word-^1 in the nsOFFSP .string after 

translation. . ^ . - 

LENGTHWORD - LENGTHtfOBril) contains the length of word .1 in 

the 30FFER stcinc af teiltrranslati:^n* ' 
BREATTTYPE' - 'SUBSTR (EREAKTYPE^I ^contains the / largest 
'deliaiter type t.ermina.ting word: I in the BUPFER. string 

aftfer translation. 
SECSTOP-- SU3STR (SECSTCl?rIr 1) contains 'a' one (X«F1«) if word 

I was found oh the secondary stoplist; zero (X^i^O*) 

otherwise., " . ' 

PRIsmp. - SUBSTR (PR/ISTCP,I,1) <:ontains a. one (X' FIM if word 

"""" ^ 

I was - found _cn the priaary stoplist ; zero (X« FO •) 
otherwise. 

STOPLIST, - * the' location of the sorted stoplist. The 
* secondary, stoplist oust be loaded first into the array 
' followed, by th^* priaary stoplist. 
LSEC - the actual nuaber of words in the secondary stoplist 

(the first LSEC 'words of STOPLIST are assuaied to hold 

the secondary stcplist).; 
LSTOP • the actual nuaber .of words in the stoplist^ 
AOTOSTOP - the upper liait^ of the nuaber of characters to be 

found in- a wor4'w,hich is *autqma tically assuaed to be'on 

the secondary stopli,st. 

The word stc^ing to be translated must pe ir.oved to the 
location BUFFER before translation can'" be^qin • . 7he strir.g is 
unaffected by any transli'tion process. The Ic-ngths of th^ 
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varying strings EREAKTYPE, SECSTOP,- aad PRISTOP reflect t.he 
nuaber; ot words ' found iny the ' string BUFFER - after 
translation. To retrieve (word the SUBSTR function is 
used by the calling j^rcgraa: /, 
^ • 'SUBSTe(BDFFEB7?>PFSEf(I) rLRNGTHWOSDCDJ, 



• 4 



This retrieves just the word with nd'^'terroinajtin^ delimiters 
attached* . . 

Tho tran^ation; algorithm is .eguipped with a speedy 
binary search whicK performs lookups in the .array STOPLIST. 
If t*he number of characters of a,, word does not ^ exceed 
AUTOSTOP, the corresponding^ locations of SECSTOP and PHISTOP 
are both set €c one. Nc lookups are performed if the ^number 
of characters found in a word exceeds MIDSTOP, the number of 
characters in each stoflist word. A word* found on the 
secondary stoplist causes the corresponding*locations of 
SECSTOP and PRISTOP to be set to one. Only after a failure 
of the secondary stoplist search is the primary stoplist 
searched>. If no stoplist lookups are desired, substitute 
any array for STOPLIST and a f ullword. binary zero for LSEC 
and LSTOP. When LSEC is egual to LSTOP, only- the first LSEC 
STOPLIST words are searched. ^ 

To initiate the translation of BUFFER, the fourth entry 

4 

point is used. • 

Calling seguence 
"CALL FIND 



206 



BIBLIOGBAPHY 



Adaas,68 , . 

Adaas,M. Arid Lockley, L., , "Scientists Meet the KVIC 
Index" > , Aierican Docunentationi^ 19 ( 1) > 47 {1 968) 

Araitaqe,67 

Araitage,J. And Lyn<Sh, M., "Articulation in the 
Generation of Subject Indfexes by Computer", Jou rna l of 
Cheaic al Dpcuaentation 7,7170(1967) ' " 

ArtaQdi,68 * / 

Artandi, So, An Introduction to Computers in 

Inforaation Sciencg|, Scarecrow Press Inc., Hetuchen, 

N.J., 1968 

■ASEE,71 

Hathis, B., Lasher, Bi , and Petrarca, A., editors. 

Participant Inde x and Subject Index for ASEE Prograjiif 

.79th Annual ASEE Heeting June, 1971, Annapolis, Md.~ 

f ' ' ' *i 

_ f 

Bel2er,.71 " . * 

"Belzer, J., "Justification for Autoaatic , Indexing by 

Frequency Distribytion of Herds", Journal of the 

il^rican Societ y for Inforaation Science. 22 (3) , 

226(1971) 

Bottle^ 70 \ 

Bottle^ R.^ "Title ! Indexes as Alerting Services in 
Cheiical and Life Sciences"^ Journal of the Aieri can 
. Societjg_for^Inforiation Science, 21(1)^ 167l970) 

BrodJLe^TO 

Brodie^S.^ '♦EYaluation of a KHIC Index for Library 
Literature"^ Journa l of the__Aaeric an Socie ty for 

InforMation. Scigncg^ 21 O) # 22(1970)i 

^ ' it 

Brown, 63 

Broiin , A . , ■ editor., goraal and Reverse English Word List, 
Oniversity of Pennsylvania, Philadelphia, 1963 

Bush, 15 

Bush,V.', "As ie Nay Think", Atlantic Monthly. 176, 
101(1945) ""7 

Carroll, 69 

Carroll, J. And Eoeloffs,R., "Computer Selection of 
Keywords .Using Word-Frequency Analysis", Aaerican 
jQ^cuaentation 20 (3), 227(1969) 



CAS, 72 ' ^ , 

.Cheaical T itles, Cheiical .Abstracts Services, Coluibus, 

Ohio 
CCH,72. 

PAMDBX Carrent Index to Scientific and TechnicaL 
Literature. CCH Corporation, a subsidijary of Crowell, 
Collier and HacHillan, Inc., Hew YorK, H.Y. - 

CheydleQr,67 

Cheydleur, B., "Indexing Depth, Retrieval Effectiveness 
and Tine Sharing", national Conference on Ele ctronic 
Inforaation Handling, edited by A. Kent, Thoipson Book 
' Co., Acadeiic Press, London, 1967, p37 

s 

Citron, 59 

Citron, J., Hart, L., and Ohl«an, H., ••A Permutation 
Index to the 'Preprints of the Internatiohal Conference 
on Scientific Inf oriation*". Report SP-a4. Systems 
Developsent Cprjporation, Santa Honica, California, 1959 

Dattola,69 

Dattola, 2., wpast Algorithm For Autoiatic Classifica- 
tion Journal of Library Aatoeation, 2(1)* 20 (1969) 

Dennis, 64 , . 

Dennis, S., ••Construction of a Thesaurus Autoiatically 
^ f rd« a Saiple (yf Text", Statiistical Association Methods 
" for. MecJ^niSfid Docunent ation Syiposini Proceed in gsj^ 

National Bureau of Standards Miscellaneous Publication 

269, 1964, p113 

Dewey, 65 

Dewey D6Ci«al Class ification and R elative Index* 17th 
Edition, Forest Press, Inc./ Lake Placid Club, New 
York, 1965 

Dolby ,68 • 

Dolby, J., "The Distribution of str.ucture-.Mord-Free 

Back-Of-The-Book " Entries", Proceedings of A SIS.' 5, 

65(1968) 

Doyle, 6 5 • 

Doyle,L., "Is Automatic Classification a Reasonable 
Application of Statistical Analysis of Text?", Journal 

of_tUe Association, for Conputing M achi nery, 12 (U), 

«73 (19657" " 

Fischer, 66 

Fischer, M., "The KHIC Index Concept: A "Retrospective 
View", ABe ri can_Pocuie nta tion* 17(1), 57 (1966) 



208 



Garfield, 55 

Garfield, "The Preparation of Printed Indexes by 
Autcaatic Punch-^Card Techniques^f , ' Aaejrican 

Documentation^ 6# 68 (1955) 

Giuliano,65 

Giuliano^y*, ••Interpretation of Word Association"^ 

§l^i:§:l4£§l As sociation Methods for Mechan i zed 

' DQcu aehta t ion Syiposiua Proceedinqs, National Bureau of 
^^-SLtandards Miscellaneous Publication 269, 1965, p25 

Her n er , 6 2 ^^^^^-^^ 

Herner,s*, "Method^ of Organizing Inforaation for Stor- 
age and Searching"^ American Docuaentation^ 13, 3(1962) 

Highcock,68. 

Highcock,S«, ••Natural Language Indexing for Automate'^ 
Inforaation Systeas**, in Classification f or Inf qraa tio i 
Betrieval edited ty K* Bakevell, Archon Books, London, 
England, 1968, p£5 

Hines,70 

Hines,T., and Harris, J*, ♦•Perauted Title Indexes: 

Neglected Consider at ion $••, Journal of the Aaer ican 

Socie ty lE or- Inf oraation Science ^ 21 (5) , 369 (1970) 

Janaske,62 

Janaske,P., ••Hanaal Preparation of a Perauted-Title 
Index** BSCP Coaaiiniqug^ Philadelphia, Pa, June, 1962 

JCED,70 , * 

Beaton, R», Caaerbn, J, , Lay, H., and Petrarca, A*, 

ed itor s , , A uthor and Subject Ind ex *to Journal of 

Cheaical and Engineering Datair 15(Uf 600 (T970) 

\Jobnson, 59 * 

Johnson, A., •'Experience in the Ose of Onit Concept 
Coordinate Indexing to Technical Reports**, Jou rnal of 
pocuaejtt>gtionje^ 19(3), 146(1959) 

Jphnson,68 

Johnson, A., *«Coordinate Indexing A Practical 

• Approach»«, in Cl assific ation for Inforaation Retrieval 
\ edited by K* Bakenell, Archon Books, Lbndon,"^ England, 
\ 1968, p73 

Jordan, 68 « 

, Jordan, J. and Watkins, W., "KHOC Index as an Automatic 
By-Product of SDI", Proceedings ASIS, 5, 211 (1968) 



209 



Kennedy, 63 

Kennedy, R.^ ♦•Writing Inforaative Titles for Technical 
Papers - A Guide to Authors'*, .in A utoiation a nd 
Scientif ic Ccaaubication edited by H. Luhn, 1963, pT33 

Landry, 69 

Landry, B., ^••An Index^^ag and He-indexing Siaulation 

Mode , CoBpu ter ..and Informatio n Scie nce Research 

Center Report . 69^1 4 | The Ohio ' State University, 
Coluabus, Ohio, 1969 

lay, 70 

Lay, W« and Ffetrarcat A« , **Modi£ied Double->KWlC Coordi- 
nate Index«^ Refinements in M^n Term and Subordinate 
Tesia Selection'% Social Impact of fofonnation Retries 
va r; (Proceedings of tne 7rti itonuai National infogaa^ 
tlon Retrieval OollognlisaT I' edited by A. Berton^ 
Medical Documentation Service , , The College of Ihysi- 
cians of Philadelphia, 1970, pl55 

Le jniek>6, 67 

Lejnieks,V«, .**The Systea of English Suffixes'*, 
Linq\iistlcs^ 29(2), 73(1967) 

Lesk,66 

Lesk^n.^ ••Word Step TierainalTdrs^, in Inf oraation 
Storage and Hetrieiral^ Scientific Report ISR-11 to the 
Rational Science Foundation, D^partaent of Coaputer 
Science, Cornell Oniversity^ Ithaca, June, 1966 

Lesk,69 

Lesk, H«, •'Boird-Hord Associations in Docuaent Retrieval 
Systeis**, ASjeriGanJ Oocaaentatio n^ 20(1), 27 (1969) 

Lo\rins,68 

Lovins,B*, ••Developaent of a Stealing Algoritha**, 

Project IBTREX, BSL-'-Tt!-353y Inforaation Processing 

Group,. Massachusetts Institute of Technology, 

Caabridge, Hassacbusetts, June, 1968, also in 
Mechanical Translation^ 11(2), 57(1970) 

Luhn, 59 

Luhn,H. , ' ••Keynord-In-Context Index for Technical 
Literature (KilC Index) ••, HC-127^ IBM Corp., Yorktown 
Height?, S.T. , 1959; also, A a eric an , Doc uaentat ion^ 
11(a) , 288(1960) > - - — 

Maizell,60 

Maizell^H*, ••Value of Titles for Indexing Purposes**, 
AmeriC'jr* Cocumen tat ion . 11. 127 (19^0) 



210 



NAPS, 69 

NAPS Docuaetit NAPS-00682 fro« ASIS National Auxiliary 
Publishing Service, c/o InforaatioQ Sciences, Inc., 22 
. West 34th St., Hew York, N.Y., 10001; remit $1.00 for 
Microfiche oij'$3.00 for photocopies 

« 

Olney,.63 

Olney, J., ••library Cataloging and Classification**, 

R eport TH-1192i^ April, 1963,' Systems Developient 

Corporation, Santa Honica, California 

Petr area, 69a 

Petrarca«^ A. and Lay, V., "The Double-KUIC Coordinate 
Index. A Nev Approach for Preparation of High-Quality 
Indexes by Automated Indexing Techniques**, J. , Che»» 
SSSix 9, 256(1969) 

Petrarca,69b 

Petrarca,-A* and Lay^ **The Double-KilC Coordinate 
Index II« Use of an Autoiatically Generated Authority 
List to Rliiinatc Scattering Caused by Soie Singular 
and Plural Hain Index Tens", ASIS Proceed in gs^ 6, 
277(1969) 

Rosenberg, 68 

Rosenberg, K. And Bloeher, C«,\*^A coiparison of 
R^leTance of KBIC Versus Descriptor Indexing Terms**, 
Aaerican Docuaentation, 19(1), 27(1968) 

e 

Rl]hl^64 

Ruhl^ **ChemicaI Documents and Their Titles: Himan 

Concept * Indexing Versus KUIC^-Machine Indexing**,. 
Anerican Documentation ^ 15(2)^ 136(1964). 

Sal ton, 68a . ' 

Salton, 6«, **Ose of Standardized Oocuaentary Data in 
Autoaatic Information Retrieval**, IEEE T ransact ions on 
E ngineeri ng Writing a nd Sp e iech^ 11 (2), 101 (1968) 

Salton, 68b 

Sal ton , G . , A utoaatic Information Organization and 
Retrieval, McGran-Hill Co., H.Y, 1968 

Salton, 69 

Salton, 6«, **A Ccaparison Between Manual and Autoaatic 
Indexing Methods**, Aaer- ican , pocugentation^ 20 (1), 

51(1969) 



.211 



Sharp, 66 

Sharp, J* , "The SLIC Index", Aae rican ^Documentation 

17(1) , i/1{1966) ^ ~ . " } 

SiBi8ons,63 ' ^ 

Silicons, R. and McConlogue, K.^ "Maxinui-Depth Indexing 
for Computer Retrieval of English Language Data", 
Amer ic an Docuaehta tion^ 68(1963) 

Skolnik,70 • 

S)cclnilc, H*, "The ROLTITERW Index A Sew toncept in 
Information Storage and Retrieval", Journ al of C hemi cal 
£2£Jli§Si5tionx 10(2), 81 (.1970) . 

Stevens, 66 . ; 

Stevens, H#, "Automatic Indexing:- A State-of-the-Art 
Report", national , Bureau of ^ Standards^ ?!ono q r aph 91^ 
March, 1965 

Taube,61 

Xaube,S!*, "Notes on the use pf Roles and Links in 
Coordinate Indexing", A merican . D o cumentatio n^ 12 (2) , 
98(1961) 

Tocatlian,70 • . 

Tocatiian, J*, "Are Titles of Chemical Papers Becoming 
More Informative?", Jo urnal o f the Am erican So ciety f or 
- i^fgrmation Scigncg^'^21 (5)^ 3^5 (T970]r 

''Tukey,68 _ - ' 

Tukey,J>^ "Multilingual Tail-Cropping", g'eEort_S-68-12£ 
Department of Statistics, Princeton University, June, 
1968 

Vickery,68 

Vickery, B* , On Retrieval Systeas^Theory^ Archon Books, 
London, England, 1968 

Young, 72 

'Young^ C*, "Design and Implementation of Language 
Analysis Procedures With Applications to -Automatic 

^ Indexing", Ph.D. Dissertation, in Progress, Dept. Of 
Computer and Information Science, The Ohio State 
University 

Zipf ,49 

Zipf , , HMS3I!-.SSilivior_and the Principle ot:._ Least 

Ifl^Etf Aldisonriessiey Puhllshing 3o. ^ Cainbridqe, 
Massachusetts, 19a- 



■ ■ ■ '212 

GLOSSARY 

Abbreviations 

AMT 'acutal^ iain ter« 

ASS actual subordinate entry 

DKHIC double-KHIC^ 

KWIC ^cey-word-in-context 

KHOC key-word-out-pf-context 

n^l maxiaal main tera 

PUT potential aain terai 

PSE potential subordinate entry 

sue selected listing in coibination 

Definitions * ^ 

descriptor - a word or phrase describinq a single concept 

term -• a combination of descriptors vhich describe a related 
collection of concepts 

entry - a tera and a aeans of locating a document containing 
the concepts described by the term 

# 

Notation 

d<j> the jth document descriptor 
i< j> the jth index descriptor 

{k<1>rk<2>,^ * . ,k<n>} a set of n descriptors ^ 

(i=1,n) SOU (f (i)) the summation over i of the function f 
having argument i 

k<i> WilOW k<j> tho union of elensfff.ts or sets '<<i> and.^!c<j> 

k<i> IiiTE'"'S:^CT *.>^j> the intersection of the f^lementH dc Zf-'xr. 
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The following is a KMIC-DKMIC ^itidex of this thesis 

prepared from the Table of Contents, List of Figures, and 

List of Tables* The nuaeric accession codes indicate the 

» 

paqe on which the section heading or caption may be founC. 
Captions are distingtiis'faed fcoa section headings by the 
terminating letter P placed on the caption accession codes. 

Th^* index Mas generated .by the KSIC-DKBIC subsystem 
described in appendix Og section C. 5^ Below, are listed the 
index generation parameters and pertinent statistics for the 
index to ^follow. 



1^4 2 phrases 

136<l words \ 

120 primary stoplist words 

2H secondary stoplist words 

,524 primary stoplist words found in titles 

759 secondary stoplist words^ found in titles 

605 distinct MMTs 

491 specificity 1 HBTs 

100 specificity innts 

12 specificity 3 HHTs 

143 dijstinct PHT groups 

9 nakimuffl posting threshold 

^ minimum posting threshold * 

9 permutation threshold 

1.21 average. FHT. specif icity* 

\ 
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MAIN TERM(S) 

ACCESS. TO MORE SPECIFIC CONCEPTS ♦HOVIDES IMMEDIATE 58P 

ACTUAL * +ENCE FREQUENCY DATA USED FOR SELECTION OF 74F 

ACTUAL * +BING THE TAILORING OF HHT RECORDS FORMING 128F 

ACTUAL * SELECTION OF 122 

ACTUAL * (AMTS) AND KHOC-DKWIC THRESHOLD VALUES +Cf- 7J{ 

ACTUAL * AND THE EXCLUSIVE PSE MARKERS PRODUCED BY ♦ 126? 

ALGORITHM ♦SE'MARKERS PRODUCED BY THE AMT SELECTION 12fiF 

# AMT SELECTION ALGORITHM +SE MARKERS PRODUCED BY THE 126P 

* AMTS) AND KHOC-DKHIC THRESHOLD VALUES +0N OF ACTUAL 74 
APPLYtNG AN AUTOMATICALLY GENERATED AUTHORITY LIST ♦ 88 
AUTHORITY LIST TO WORDS OF * (COMPARE FIGURE 6.2) ♦ 88 
AUTOMATED * SELECTICN PROCESS ♦ LOGICAL PLOW FOR AN 114P 
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HAIN TBHM(S) (COMT) 

AUTOMATED * SELECTICJiS FOR THE PUT TREE OF FIGURE .7* 1 1 5F 
AUTOMATIC * SBLECTICNS PERFORMED ON THE PHT^'TREE OF* 116P 
AUTOMATICALLY GENERATED AUTHORITY LIST TO WORDS OF- ♦ 88 
BALLOCMING EFFECT IH THE PROTOTYPE DKtfIC INDEX CADS* 66F 
CAUSED BY PERMOTISG SOBORDINATE ENTRIES UNDER * DER* 66F 

* roMPAEE FIGURE 6.2) ♦TED AUTHORITY LIST TO HORDS OF 88 
CCMPAEISON OP THE NUMBER OF * GENERATED AT A PARTIC* 138F 
CONCEPTS ♦ROVIDES IMMEDIATE ACCESS TO MORE SPECIFIC 58F 
CONSISTING OP ALL EBTS HRICH BEGIN WITH THE SAME iiO* 1C1F 
CRITERIA ON GENERATION OF POTENTIAL * AMD ♦ELECTION 73F 
DATA USZD FOR SELECTION OF ACTUAL ♦ fENCE PrIqUENCY IHF 
DELIMITERS AND SELECTION CRITERIA ON GENERATION OF ♦ 73F 

* DERIVED FROM ONLY A SINGLE TITLE ♦ATE ENTRIES UNDER 66F 

DKMIC INDEX A THREE-WORD ♦ OF A -59? 

DKWIC INDEX SELECTING ♦ FOR A. KWOC 175 

DKWIC INDEX AS A RESULT OF APPLYING AN AUTOMATICALLY 88 
DKWIC INDEX CAUSED BY PEfittDTING SUBORDINATE ENTRIES^ 66F 
DKWIC THRESHOLD VALUES *0f ACTUAL * (AHTS) AND KWOC 74 
EFFECT IN THE PROTOTYPE DKBIC INDEX CAUSED BY PERMO^ 66F 
EFFECT OF WORD DELIPITERS AND SELECTION CRITERIA ON^ 73F 
ENTRIES UNDER * DERIVED FROM ONLY A SINGLE TITLE ♦E 66P 
EXCLUSIVE PSE MARKERS PRODUCED BY THE AMT SELBtTION+ 126F 

EXTRACTION OF POTENTIAL * (PMTS) 7 69 

FLOW FOE AN AUTOMATED * SELECTION PROCESS ♦ LOGICAL lUF 

FLOWCHART DESCRIBING MAXIMAL * GENERATION . ., 121F 

FLOWCHART DESCRIBING THE TAILORING OF HMT RECORDS F* 128F 
FORMATS OP THE ACTUAL * AND THE EXCLUSIVE PSE MARKED 126F 
FREQUENCY DATA USED POR SELECTION OF ACTUAL ♦ ♦ENCE 7i*P 

* GENERATED AT A PARTICULAR SPECIFICITY AS POSTING LI^ 138F 
GENERATED AUTHORITY LIST TO WORDS OF ♦ (COMPARE FIG^ 88 

* GENERATION FLOWCHART DESCRIBING MAXIMAL 121F 

GENERATION OF MAXIMAL 119 

GENERATION OP POTENTIAL'* AND ♦ELECTION CRITERIA ON 73F 

* GROUP CONSISTING pP ALL PMTS WHICH BEGIN WITH THE S* 101F 
HUMAN . INTERFACE REQUIREMENTS FOE THE SELECTION OP A+ IH 

INDEX A THREE-WORD * OF A DKWIC ■59P 

INDEX SELECTING * FOR A KWOC DKWIC 175 

INDEX AS A RESULT CP APPLYING AN AUTOMATICALLY GENE* 88 
INDEX CAUSED BY PEERUTING SUBORDINATE ENTRIES UNDER+ 66F 
INTERFACE REQUIREMENTS POR THE SELECTION OF ACTUAL ♦ 74 

KWOC DKWIC INDEX SELECTING *. FOR A 175 . 

KWdC-DKWIC THRESHOLD VALUES ♦OF ACTUAL * (AMTS) AND 74 
LI«1ITS ARE VARIED ^ARTICULAR SPECIFICITY AS POSTING 138F 
LIST AND OCCURRENCE FREQUENCY DATA USED FOR SELECTI* 74? 
LIST TO. WORDS OF * (CCMEARE FIGURE 6.2) ♦ AUTHORITY 88 
LOGICAL FLOW FOR AN AUTOMATED * SELECTION PROCESS ♦ 114F 
MARKERS PRODUCED BY THE AMT SELECTION ALGORITHM ^SE 126F 

MAXIMAL * GENERATION OF 119 

MAXIMAL * (KMTS) AND SPECIFICITY UNITS 109 

MAXIMAL * FORMED FRtfi ^HE SPECIFICITY UNITS ILLUSTR* 11 IF 
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MAIN TERB(S) (CONT) 

MAXIMAL * GENEBATICN FLOHCHABT DESCRIBING 121P 

SMT EECOSDS FORMING ACTUAL * ♦BING THE TAILOEING OF 128P 

* MHTS) AND SPECIFICITI UNITS MAXIMAL 109 

NOMBEB OF ♦ GENERATED AT A PABTICULAR SPECIFICITY A* 138F 
OCCORHENCl FREQUENCY DATA USED FOB SEIECTION OP ACT* TUF 
PEBHDTING SUBORDINATE ENTRIES UNDER * DERIVED FROM 66F 
PMT LIST AND OCCUEEENCE FREQUENCY DATA USED FOB SSL* 74F 

PMT TFEE OF FIGURE 7.3 ♦SELECTIONS PEEFOBHED ON THE 116F 

PMT TREE OF FIGURE 7.3 ♦OHATED ♦ SELECTIONS FOR THE 1 1 5F 

P.1TS HHICH BEGIN WITH THE SAME HOBD (SEE TEXT) ♦ALL 101F r 

* PMTS) EXTRACTION OF POTENTIAL 69 ' 

POSTING LIMITS ABE VARIED i-ABTICULAR SPECIFICITY AS 138F 

POTENTIAL ♦ (PMTS) EXTRACTION OF 69 

POTENTIAL * AND ♦EIECTIOB CBITEBIA ON GEMEBATION OP 73F 

POTENTIAL ♦ GBOUP CONSISTING OF ALL PSTS WHICH BEGI^ 10 IF 

PROCESS ♦ LOGICAL FLOW FOB IN AUTOMATED » SELECTION HUP 

PRODUCED BY THE AHT SELECTION AIGOHITHM ♦SE HABKEBS 126F 
PBOTOTYPE DKilC INDEX CAUSED" B.Y PERMUTING SUBORDINA^ 66P 

PSE HABKEBS PBODUCED BY THE A^T SELECTION ALGOBITHH^ 126F 

BECOBDS FOBBING ACTUAL ♦ ♦BING THE TAILOBING OF MHT 128F 
FfEUCED SCATTEBING III A DKWIC INDEX AS A BESOLT OF ♦ 88 
REQUIREMENTS FOB TEE SELECTION OF ,ACTUAL ♦ (AMTS) A^ 74 
BESDLT OF APPLYING AN , AUTOMATICALLY GENEBATED AUTHOR 88 
SCATTEBING IN -A DKBIC INDEX AS A RESULT OF APPLYING^ 88 

SEE TEXT) ♦ALL PHIS MHICH BEGIN WITH THE SAKS WOBD 10 IF 

SELECTING ♦ FOB A KHOC DKBIC wIHDEX 175 

SELECTION ALGORITHM ♦SE HABKEBS PRODUCED BY THE AHT 126P 
SELECTION CBITEBIA ON GENEBATION OF POTENTIAL ♦ AND^ 73P 

SELECTION OF ACTUAL ♦ ► 122 

SELECTION OF ACTUAL ♦ ♦ENCB FREQOENCY DATA OSBD FOB IHF 
SELECTION OF ACTUAL ♦ (AMTS) AND KiOC-DKHIC THBESHO^ 74 

* SELECTION PROCESS ♦HE LOGICAL FLOW FOR AN AUTOMATED 1 lUF 

* SELECTIONS FOB THE PMT TREE OF FIGURE 7.3 ♦UTOMATED 115F 

* SELECTIONS PER.FOBMED CN THE PHT TBEE OF FIGURE 7.3 1 1 6F 
SIZE BALLOONING EFFECT IN THE PBOTOTYPE DKWIC INDBX^ 66F 
SPECIFIC CONCEPTS ♦ROVIDES IMMEDIATE ACCESS TO MORE 58F 
SPECIFICITY AS POSTING LIMITS ARE VARIED ♦ARTICULAR 138P 

SPECIFICITY UNITS MAXIMAL ♦ (UMTS) AND 109 

SPECIFICITY UNITS ILLUSTRATED IN FIGURE 7.5 ♦OW THE 111P 

- SDBTORDINATE ENTRIES UNDER * DERIVED FROll ONLY A SIN^ 66F 

SUMMARY OF AUTOMATIC ♦ SELECTIONS PERFORMED ON THE ♦ 116F 

TEXT) ♦ALL PMTS HHICH HEGIn' WITH THE SAME WORD. (SEE 101? 
THRESHOLD VALUES ♦OF ACTUAL ♦ (AMTS) AND KWOC-DKWIC 74 
TITLE *E ENTRIES UNDER * DERIVED FROM ONLY A SINGLE 66F 

TRACE OF AU'iOMATED * SELECTIONS FOR THE PMT TREBiOF^ 115F 

TREE OF FIGURE 7.3 ♦SELECTIONS PRRFORMED ON THE ' PMT 116P 

TREE CP FIGURE 7.3 ♦CMATED * SELECTIONS FOP THE ;PMT 1 1 5F 

UNITS ...MAXIMAL * (MKTS) AND SPECIFICITY 109 

UNITS ILLUSTRATED IN FIGURE 7.5 ♦OH THE SPECIFICITY 11 IF 
VALUES ♦OF ACTUAL ♦ (AMTS) AND KWOC-DKWIC THRESHOLD 74 



HAIM TEFH{S) (COJIT) 

VARIED ♦ARTICULAR SPECIFICITY AS POSTING LIMITS ARE 138P 

WORD (SEE TEXT) ♦ALL PHTS HHICH BEGIN WITH THE SAME 10 IF 

WORD ♦ 0? A DKBIC INDEX A THREE- 59P 

iORD ♦ HHICH PROVIDES IMMEDIATE ACCESS TO MORE SPEC* 58F 

WORD DELIMITERS AND SELECTION CRITERIA ON GENERATIO+ 73F 

WORDS OF * (COnPABE FIGURE 6.2) ♦ AUTHORITY LIST TO 88 

MAXIMAL MAIN TERM GENERATION FLOBCHABT DESCRIBING 121F 

MAXIMAL MAIN TERMS i GENERATION OF 119 

MAXIMAL MAIN TERMS (MHTS) AND SPECIFICITY OHITS 109 

MAXIMAL MATN TERMS FCFMED FROM THE SPECIFICITY UNITS I* 11 IF 

HAXIHUH POSTING THRESHOLD, PERMUTATION THRESHOLD, AND ♦ 13aF 

MESSAGE (S) ISSUED BY THE AUTHORITY LIST GENERATOR 196 

MESSAGE (S) ISSUED BY THE KBIC DKHIC* INDEX SUBSYSTEM .. 187 

MESSAGE (S) ISSUED BY THE KlIOC DKilC INDEX SUBSYSTEM .. 177 

MINIMUM POSTING THRBSHCID, MAXIMUM POSTING THRESHOLD, ♦ 134F 

HMT(S) FILE AND AMT MARKER FILE ♦TION OF AMTS FROM THE 127 

MMT(S) GROUP ♦NG THE CCNSTRUCTION OP A PHT TREE FROM A 124F 

HMT(S) GROUP ILLUSTRATED IN FIGURE 7. 1» ♦FORMAT FOR THE 123F 
MMT(S) GROUP IN FIGURE 7.1» ♦TED IN FIGURE 7.2 FROM THE'113F 

HMT(S) GROUP OF FIGURE 7.4 ♦LECTION ALGORITHM FROM THE 127F 

HHT(S) RECORDS FORMING ACTUAL MAIN TERMS ♦TAILORING. OF 128F 

MMT(S)) AND SPECIFICITY UNITS ....MAXIMAL MAIN TERMS ( 109 

MODIFIED SYSTEM DESIGN: PRODUCTION OF KBOC-DKWIC HYBRI- 68 

NATURAL LANGUA.GE ♦ES DUE TO THE SYNTACTIC STRUCTURE OF ia7F 

NATURAL LANGUAGE INDEXING VOCABOLAfiY CONTROL FOR 11 

NODE(S) ♦TS (P) AND EXCLUSIVE PSE SETS {?.) FOR ALL THE 107F 

NORMALIZATION IN A PANCEX INDEX ■ COLL ATING PREFERRED HO^ 9 IF 

OCCURRENCE FREODENCY DATA USED FOR SELECTION OF ACTUAL^ 74F 

OCCURRENCE FREQUENCY CN THE SELECTION OF AMTS ♦ND HORD 13aF 

OCCURRENCE OP SINGULAR AND PLURAL WORD FORMS ♦E TO THE SOP 

ORDERING OF A SINGLE SECONDARY CONCEPT FOR EACH TITLE 52F 

OVERRIDE COMMANDS NECESSARY TO FORM THE AMT SELECTIONS^ 113F 

PANDEX INDEX 36 

PANDEX- INDEX -A PORTION OP A 38P 

PANDEX INDEX COLLATING PREFERRED BORDS BUT DOES NOT AL^ 91F 

PANDEX INDEX FOR THE SAME TITLES OF FIGURE 4.1 ILLUSTR^ 52P 

PARAMETER (S) .' AUTHORITY LIST EXECUTION 190 

PARAMETER (S) KHIC DKWIC EXECUTION 181 

PARAMETER (S) KHOC DXHIC EXECUTION 169 

PARAMETER (S) ON CHARACTERISTICS OF THE INDEX AND SUPPO^ 132. 

PERMUTATION THRESHOLD, AND HORD OCCURRENCE FREQUENCY 0^ 134F 

PERMUTED ENTRIES OF INDEXES PREPARED FROM THE SAME TIT^ 137F 

PERMUTED KEYBORD, INDEX COMPLETELY 22 

PERMUTED SUBORDI^<ATE ♦ PROTOTYPE DKMIC INDEX CAUSED BY 67F 

PERMUTERM I.1CEX 26 

PERMUTERM INDEX A PORTION OF A 28P 

PERMUTING SUBORDINATE ENTRIES UNDER MAIN TERMS DERIVED^ 66F 

PLURAL WORD FORMS ♦E TO THE OCCURRENCE OF SINGULAR AND 80F 

PLURAL-SINGULAR STEMMING-RECODING ALGORITHM 84 

PLURAL-SINGULAR STEMMING-RECODING ALGORITHM +£0 3Y THE 87P 
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FHT(S) 

ACTUiL «MN TERHS ♦UENCT DATA USED FOR SELECTIOiN OF 74F 
ALGORITHSS ♦E * GENERATION PROCESS OS AMT SELECTION 105 
AMT SELECTION ALGOBITHHS *E * GENERATION PEOCSSS ON 105 
AMT TREE CHOSEN PEGH THE ♦ GROUP OP FIGURE 7.1 .AN 102P 
AUTOMATED WAIN TERI? SELECTIONS FOR THE * TREE OF PI* 1 1 5F 
AUTOMATIC MAIN TERM SELECTIONS PERFORMED ON THE * T* 116F 
CONSISTING OP ALL ♦ iHICH BEGIN HITH THE SAME WORD ♦ 10 IF 
CONSTBUCTION OF' A ♦ THEE FROM A MMT GROUP ♦BING THE 124P 
DATA USED FOR SELECTION OF ACTUAL MAIN TERHS ♦UENCT 7^F 
EXCLUSIVE PSE SETS (Z) FOR ALL THE NODES *S (P) AND 107F 

EXTRACTION OF POTENTIAL MAIN TERHS (♦) 69 

FLOWCHART DESCRIBING THE CONSTRUCTION OF A ♦ TREE F+ 124F 
FORMAT FOB THE HMT GROUP ILLUSTRATED IN FIGURE 7.4 123F 
FREQUENCY DATA USED FOR SELECTION OF ACTUAL MAIN TE+ 74F 

* GENERATION PROCESS ON AMT SELECTION ALGORITHMS i-THE 105 
GROUP ♦BING THE CCMSTRUCTION OF A * TREE FROM A MHT 124F 
GROUP CONSISTING OF ALL * WHICH BEGIN HITH THE SAME* 101F 

.GROUP ILLUSTRATED IN FIGURE 7.4 ♦FORMAT FOR THE HMT 123P 

* GROUP OF FIGURE 7.1' ...AN AMT THEE CHOSEN FROM THE 102F 

* GROUP OF FIGURE 7.1 *hL * STATISTICS, Z<T>, FOR THE 108F 
♦.GROUP OF FIGURE 7.1 SHOWING VALUES FOR TOTAL PSE SE+ 107F 

INFLUENCE OF THE ♦ GENERATION PROCESS ON AHT SELECT* 105 
LINEARIZED ♦ TREE FORMAT FOR THE J1MT GROUP ILLUSTRA+ 123F 

* LIST AND OCCDPRENCI FREQUENCY DATA USED FOR SELECTI* 74F 
MAIN TERM GROUP CONSISTING OF ALL * HHICH BEGIN HIT* 101F 
HAIN TERH SELECTIONS FOB THE ♦ TREE OF FIGURE 7.3 ♦ 115F 
MAIN TERH SELECTIONS PERFORMED ON THE ♦ THEE OF FIG* 116P 
MAIN TERMS ♦UfNCI DATA USED FOR SELECTION OF ACTUAL 74F 

HAIN TERMS (♦) »... EXTRACTION OF POTENTIAL 69 

MMT GROUP i-BING TEE CONSTRUCTION OF A ♦ TREE FROM A 124F 
HMT GROUP ILLUSTRATED IN FIGURE 7.4 ♦FORMAT FOR THE 123F 
NODES ♦S (P) AND EXCLUSIVE PSE SETS (Z) FOR ALL THE 107F 
OCCURRENCE FREQUENCY DATA USED FOR SELECTION OF ACT* 74F 
POTENTIAL HAIN TERH GROUP CONSISTING OF ALL ♦ WHICH* 101P 

POTENTIAL HAIN TERHS (*) EXTRACTION OP 69 

PROCESS ON AHT SELECTION ALGORITHMS ♦E * GENERATION 105 
PSE SETS (P) AND EXCLUSIVE PSE SETS (2) FOB ALL THE* 107P 
PSE SETS (Z) FOR All THE. NODES +3 (P) AND EXCLUSIVE 107P 
SEE TEXT) ♦OF AIL ♦ WHICH BEGIN HITH THE SAME WORD 10 IP 
SELECTION ALGORITHMS ♦B ♦ GENERATION PROCESS ON AHT 105 
SELECTION OF ACTUAL MAIN TERMS ♦UENCYDATA USED FOB 74P 
SELECTIONS FOR THE ♦ TREE OP FIGURE 7.3 ♦ MAIN TERM 115? 
SELECTIONS PERFORMED CN THE ♦ TREE OP FIGURE 7.3 ♦M 116P 
SETS (P) AND EXCLUSIVE PSE SETS (Z) FOR ALL THE NOD* 107F 
SETS (Z) FOR ALL THE NODES *S (P) AND EXCLUSIVE PSE 107F 

* STATISTICS, Z<T>, FOR THE ♦ GROUP OP FIGURE 7.1 ♦AL 108P 
SUMMARY OF AUTCHATIC HAIN TERM SELECTIONS PERFORMED* 116P 
TERM GROUP CONSISTING OP ALL ♦ HHICH BEGIN HITH THE* 101P 
TERM SELECTIONS FOR THE ♦ TREE OF FIGURE 7.3 ♦ MAIN 115P 
TERM SELiCTIONS PERFORMED ON THE ♦ TREE OP FIGURE 7+ 116P 
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P1T(S) (CONT). 

TERMINAL-* STATISTICS, Z<T>, POP THE * GROUP OF FIG* 108F 
TERHS ♦OBNCY DATA OSED FOR SELECTION OF ACTUAL MAIN TUP 

TERMS (*) EXTRACTION OF POTENTIAL MAIN 69 

TEXT) ♦OF ALL * WBICH BEGIN WITH THE SAME WORD (SEE lOlF 

TPACE OP AO TOMA TED MAIN TERM SELECTIONS FOR THE * T* 1 1 5F 

TREE. CHOSEN PROM THE ♦ GROUP OP FIGOHE 7.1 .AN AMT 102F 

* TREE FOR THE * GRCOF OP FIGURE 7.1 SHOiING VALUES F* 107F 

* TREE FORMAT FOR TKE MMT GROUP ILLUSTRATED IN FIGURE* 123P 

* TREE PROM A MMT GROUP ♦RIEING THE CONSTRUCTION OF A 12aF 

* TREE OF FIGURB'7.3 ♦BRM SELECTIONS PERFORMED ON THE 116P 

* TREE OP FIGURE 7.3 ♦ED MAIN TERM' SELECTIONS FOR THE 1 1-5P 
• VALUES FOR TOTAL PSE SETS (P) AND EXCLUSIVE PSE SET* 107F 

WORD (SEE TEXT) +0? ALL * WHICH BEGIN WITH THE SAME 101F 

Z<T>, FOR THE * GHCUP OP FIGURE 7.1 ♦ * STATISTICS, 108F 

POSTING LIMITS ARE VARIED ♦A PARTICULAR SPECIFICITY AS 138P 

POSTING THRESHOLD, HAXIHUH POSTING THRESHOLD, PERMUTAT* 13aF 

POSTING THRESHOLD, PEBHUTATION THRESHOLD, AND UOUD OCC* ^3^^F 

POSTING THRESHOLDS ♦ FROM THE SAME TITLES WITH VARIOUS 137F 

POTENTIAL MAIN TERM GROUP CONSISTING OP ALL PHTS WHICH* lOlF 

POTENTIAL MAIN TERHS (fHTS) EXTRACTION OP 69 

POTENTIAL MAIN TERMS AND ♦ON CRITERIA ON GENERATION OP 73F 

POTENTIAL SUEORDINATE ENTRY) SETS ♦TING EXCLUSIVE PSE 106 

PREFERRED WORDS BUT DCES NOT'ALTER THE ORIGINAL TEXT ♦ 9^^ 

PRINTED INDEXES STEMMING AND RECODING FOR 83 

PROTOTYPE 

ANNOTATED DESCRIPTION OF THE * DOUBLE-KWIC COORDINA^ 55F 

BALLOCNING EFFECT IN THE ♦ DKWIC INDEX CAUSED BY PE^ 67F 

BALLOONING EFFECT IN THE * DKWIC INDEX CAUSED BY PE^ 66F 

CAUSED BY PERMUTED SUBORDINATE +N THE * DKWIC INDEX 67F ' 

CAUSED BY PERMUTING SOBORDIHATE ENTRIES UNDER MAIN ♦ 66F 

CONSTRUCTION OP THE * DCUBLiS-KWIC COORDINATE INDEX ♦ SUP 

COORDINATE INDEX THE * DOUBLE-KWIC (DKWIC) H6 

COORDINATE INDEX (DKWIC) ENTRIES ♦THE * DOUBLE-KWIC SUP 

COORDINATE INDEX DISPLAY FORMAT ♦ THE * DOUBLE-KWIC 55F 

DERIVED PFOH- ONLY A SINGLE TITLE ♦ UNDER MAIN TERMS 66F 

DESCRIPTION OF THE * DOUBLE-KWIC COORDINATE INDEX D^ 55F 

DESIGN ♦ SYSTEM 62 

DESIGN FOR CREATING THE * DKWIC INDEX SYSTEM 6^F 

DISPLAY FORMAT ♦ TEE * DOUBLB-KWIC COORDINATE INDEX 55F 

DKWIC HYBRID INDEX ♦ATION OF THE * SYSTEM: THE KWOC 66 

* DKWIC INDEX SYSTEM DESIGN FOR CREATING THE 6aF 

* DKWIC INDEX CAOSEC EY PERMUTED SUBORDINATE ♦ IN THE 67F 

* DKWIC INDEX CAaSEE EY PERMUTING SUBORDINATE BN.TRIES^ 66F 

* DKWIC INDEX ILLUSTRATING SCATTERING DUE TO -THE 'OCCD^ 80F 

DKWIC) COORDINATE INDEX THE * D0UBLI5.-KWIC ( 1*6 

DKWIC) ENTRIES ♦THE * DOUBLE-KWIC COORDINATE INDEX 54F ' 

♦DOUBLE-KWIC (DKWIC) . COORDINATE INDEX THE a6 

* DOUBLE-KWIC COORDINATE INDEX (DKWIC) ENTRIES ♦F THE 5aF 

* DOUBLE-KWIC COORDINATE INDEX DISPLAY FORMAT ♦OF THE 55F 
EFFECT AND SIZE EALLGONING EFFECT IN THE * DKWIC IN^ 67F 
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PROTOTYPE (CCNT) 

EFFECT IH THE ♦ DKWIC INDEX CAUSED BY PEPMUTED SUBO+ 67F 

EFFECT IH THE * DKWIC INDEX CiOSED BY PERHUTING SUB* 66F 

ENTRIES ♦THE ♦ DOOEIE-KKIC COORDINATE INDEX (DKHIC) 5HF 

ENTRIES ONDER HAIN TERHS DERIVED FROM ONLY A SINGLE* 66F 

E?AL0ATION AND MODIFICATION OF" THE * SYSTEW: THE KW+ 66 

FORMAT ♦ THE * DO'UBLE-KWIC COORDINATE INDEX DISPLAY 55F 

FORMS +0 THE OCCORREHCE OF SINGULAR AND PLURAL WORD 80F 

HYBRID INDEX ♦ATICN OP THE ♦ SYSTEM: THE KHOC-DKHIC 66 . 

ILLUSTRATING SCATTERING DUE TO THE OCCURRENCE OF SI* 80F . 

INDEX ♦ATION OF THE ♦ SYSTEM: THE KWOC-DKSIC HYBRID 66 

INDEX SYSTEM DESIGN FOR CREATING THE ♦ DKHIC 64F 

INDEX ........THE * DOUBLE-KMIC (DKBIC) COORDINATE 'J6 

INDEX (DKHIC) ENTRIES ♦THE ♦ DOUBLE-KHIC COORDINATE 54F 

INDEX CAUSED BY PERMUTED SUBORDINATE ♦N THE ♦ DKHIC 67F 

INDEX CAUSED BY PERMUTING SUBORDINATE ENTRIES UNDER^ 66F 

INDEX DISPLAY FORMAT ♦ THE * DOUBLE-KHIC COORDINATE 55F 

INDEX ILLUSTRATING SCATTERING DUE TO THE OCCURRENCE* 80F 

KHIC (DKHIC) COORDINATE INDEX THE ♦ DOUBLE- '♦6 

KMIC COORDINATE INDEX (DKHIC) ENTRIES ♦THE * DOUBLE 54F 

KHIC COORDINATE INDEX DISPLAY FORMAT ♦ THE \* DOUBLE 55F 

KHOC-DKHIC HYBRID INDEX ♦ATION OP THE ♦ SYSTEM: THE 66 

MAIN TERMS DERIVED FROM ONLY A SINGLE TITLE ♦ UNDER 66F 

MODIFICATION OF THE * SYSTEM: THE KI^OC-DKHIC HYBRID^ 66 

OCCURRENCE OF SINGULAR AND PLUHAL HORD FORMS ♦o' THE 80F 

PERMUTED SUBORDINATE ^N THE * DKHIC INDEX CAUSED BY 67F 

PERMUTING SUBORDINATE ENTRIES UNDER MAIN TERM3 DERI+ 66F 

PLURAL HORD FORMS ♦O THE OCCURRENCE OF SINGULAR AND 80F 

SCATTERING DUE TO THE, OCCUBRENCE OP SINGULAR AND PL^ 80F 

SINGULAR AND PLURAL HORD FORMS ♦O THE OCCURRENCE OF 80F 

SIZE BALLOONING EFPBCT IN THE * DKHIC INDEX CAUSED ♦ 66F 

SIZE BALLOONING EFFECT IN THE * DKHIC INDEX CAUSED ♦ 67F 

STUTTERING EFFECT AND SIZE BALLOONING! EFFECT IN THE^ 67F 
SUBORDINATE ♦N THE ♦ DKHIC INDEX CAUSED BY PERMUTED. 67F 
SUBORDINATE ENTRIES UNDER MAIN TERMS DERIVED FROM 0+ \ 66F 

* SYSTEM DESIGN 62 

SYSTEM DESIGN FOR CREATING THE * DKHIC INDEX f>4F 

♦ SYSTEM: THE KHOC-DKHIC HYBRID INDEX ♦ICA.TION OF THE 66. 
TERMS DERIVED FROM ONLY A SINGLE T,ITLE + UNDER MAIN 6'6F 
TITLE ♦ UNDEJi MAIN TERMS DERIVED FROM ONLY A SINGLE 64f 
WORD FORMS +0 THE OCCURRENCE OF SINGULAR AND PLURAL .80^ 

PROXIMITY RESTRICTIONS TO ASE SELECTION ♦ING SOME HORD 1l»2F, 

PSE (POTENTIAL SUBORDINATE ENTRY) SETS ♦TING EXCLUSIVE 106 

PSE .COUNT MARKERS AUTOMATICALLY PRODUCED BY THE AMT SE^ 127p \ 

PSE MARKERS PRODUCED BY THE A-MT SELECTION ALGORITHM *E 126F 

PSE SETS (P) AND EXCLUSIVE PSE SETS (Z) FOP ALL THE NO^ 107F \ 
PSE SETS- -(Z).. FOR ALL THE NODES ♦SETS (P).. AND EXCLUSIVE' lOVF 

RANDOMIZATION OF SECCNtARY 'CONCEPTS FOR THE HIGH-DF,flSI+ 50F 

PANCCMIZATICN OF SECCNCARY CONCEPTS FOR THE SAME TITLED 'J9F 

PANDOfllZATION OF SECCNCARY CONCEPTS FOUND FOE A HIGH-D^ a7F 

DECODING ALGOi^ITHM BY THE PLOPAL-SINGUIAP STEMMING, 87? 



RE.CODING ALGOPITHM 

RHCODING FOR PRINTED INDEXES 



PLURAL-SINGOLAR STEMMING- 
STEMMING AND 
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RELATIONSHIP (S) BETWEEN INDEXING AND DOCOKENT RET.RIEVA* 7 

REQfilFEMENT(S) DATA BASE iJJTERFACE 197 

REQUIREMENT (S) FOR THE DKWIC INDEXING OPERATIONS -fFACE 95 
REQOIFEMENT(S) FOR THE SELECTION OF ACTUAL MAIN TERMS ♦ 74 

REQUIREMENT (S) OF AN INTERFACE SUBROUf.INE 198 

RESEARCH *IILTS, CONCLUSIONS, . AND DIRECTIONS FOR( FUTURE 132 - 
RESEARCH AND POSSIBLE IMPROVEMENTS IN THE pKWIC INDEXI* 139- 
RESTRICTION (S) ♦UTHORITY LIST SUBSySTEM\:?,MPLEMENTATION 197 
RESTRICTION (S) ♦C DKilC INDEX SUBSYSTEM IMPLEMENTATION 189 
RESTRICTION (S) ♦C DKMIC INDEX SUBSYSTEM IMPLEMENTATION 179 
RESTRICTION (S) TO ASE SELECTION ♦G SOME WORD PROXIMITY 1U2F 
RESniLT(S) OF APPLYING AN AUTOMATICALLY GENERATED AUTHO* 88 
RETRIEVAL ♦RELATIONSHIPS BETMEEN INDEXING AND DOCUMENT 7 ^ 
ROTATED KEYHORC INDEX / 21 



SCATTERING DUE TO T^HE OCCURRENCE OF SINGULAR AND PLURA* 80F 

SCATTERING IN A DK,WIC INDEX AS A RESULT -OF APPLYING AN* 88 

SCATTERING IM A KIIIC INDEX INFLECTIONAL 79F 

SCATTERING THAT OCCURS IN DOUBLE-KBIC COORDINATE INDEX* n7F 
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